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FOREWORD 


On  October  3-4,  1984,  the  Defense  Advanced  Research  Projects  Agency, 
Information  Processing  Techniques  Office  held  a  workshop  on  Image  Understanding. 
This  workshop  was  the  fifteenth  in  the  series  which  have  been  conducted  by  DARPA 
since  1975.  The  purpose  of  the  workshop  was  to  bring  together  the  research 
community  and  the  user  community  so  that  each  could  benefit  from  the  interaction 
and,  at  the  same  time,  promote  a  synergism  which  would  improve  the  efforts  of  the 
individual  participants  as  well  as  the  communities  as  a  whole.  Toward  this  end, 
each  Principal  Investigator  under  the  DARPA  program  reviewed  progress  during  the 
past  year  and  research  personnel  presented  specific  technical  details  of  selected 
facets  of  their  research  work.  This  Proceedings  incorporates  the  P.l.  reports  and 
the  technical  reports  presented  at  the  workshop  and,  in  the  interests  of  providing  a 
total  record  of  the  research  program,  includes  copies  of  those  technical  papers 
which  were  not  presented  due  to  the  press  of  time. 

Commander  Ronald  B.  Ohlander,  Assistant  Director  for  Computer  Science  for 
the  DARPA/IPTO,  and  the  program  manager  for  the  Image  Understanding  research 
program  chose  as  a  theme  for  this  year's  workshop  "Future  Directions  for  I.U."  The 
community  should,  he  observed,  reassess  itself  and  take  a  close  look  at  where  it  is 
headed  in  the  next  five  years.  Commander  Ohlander  believes  that  we  should  be 
concerned  with  new  architectures  for  high-level  vision  processing,  the  use  of 
techniques  from  other  areas  of  AI,  such  as  expert  systems  technology,  and  major 
thrusts  in  modeling  and  representation  systems.  With  this  in  mind,  a  feature  of  this 
years  workshop  was  the  inclusion  of  a  special  panel  discussion  session  to  air  various 
views  on  the  future  directions  for  Image  Understanding,  and  an  overview  by 
Commander  Ohlander  of  the  I.U.  portions  of  the  Strategic  Computing  Program,  one 
of  the  major  technology  thrusts  being  undertaken  by  the  Defense  Advanced 
Research  Projects  Agency. 

This  proceedings  has  been  supplied  to  the  Defense  Technical  Information 
Center  (DT1C)  and  copies  may  be  secured  from  that  Agency  by  writing  to  the 
following  address: 


Defense  Technical  Information  Center 
Cameron  Station,  Bldg.  #5 
Alexandria,  VA  22314 

A  small  charge  is  assessed  by  the  DTIC  for  reproduction  expenses.  Accession 
number  for  this  proceedings  is  not  yet  available  but  will  be  assigned  by  the  DTIC 
within  the  next  thirty  days.  Accession  numbers  for  previous  issues  are  listed  on  the 
following  page. 


! 


The  images  chosen  for  the  cover  design  of  this  proceedings  were  provided  by 
the  Computer  and  Information  Science  Department,  Lederle  Graduate  Research 
Center,  University  of  Massachusetts  at  Amherst.  The  images  were  essentially 
taken  from  Daryl  Lawton's  thesis*  and  are  meant  to  illustrate  the  type  of  work  that 
can  be  obtained  from  motion  processing.  The  notes  accompanying  the  images 
describes  the  process  as  follows: 

The  three  photos  shown  on  the  left  side  represent  successive  frames  of  an 
image  sequence  taken  from  a  vehicle  moving  along  a  road.  The  images  show  a 
road  sign  in  the  foreground,  a  telephone  pole  in  the  mid-ground,  and  a 
background  of  tree  foliage. 

In  the  center,  the  displacement  vectors  show  the  zero-crossing  contours  of  the 
Laptacian  of  the  images  in  gray.  Superimposed  on  these,  in  white,  are  the 
locations  of  interest  points  along  these  contours.  The  lines  extending  from 
these  points  show  the  displacement  of  these  interest  points  through  subsequent 
frames  along  radial  flow  lines  emanating  from  the  focus  of  expansion  of  the 
optic  flow,  which  is  found  by  an  optimizing  search  procedure. 

Once  the  focus  of  expansion  has  been  determined,  displacement  vectors  can  be 
computed  along  almost  all  contours,  even  those  on  which  there  are  no  interest 
points.  From  these  displacements,  the  environmental  depth  of  the  contours 
can  be  calculated.  For  the  depth  segmentation,  as  shown  in  the  three  right 
hand  photos,  the  contours  are  separated  into  intervals  of  environmental 
depth.  This  fairly  well  extracts  the  road  sign,  the  telephone  pole,  and  the 
background  foliage,  which  lie  at  distinct  ranges  of  increasing  depth  from  the 
observer. 

As  usual,  the  artwork  and  lay  out  for  the  Proceedings  cover  is  the  work  of  Mr. 
Tom  Dickerson  of  Science  Applications  International  Corporation.  Apppeciation  is 
also  due  to  Ms.  Lori  Beth  DeFuria  who  handeled  all  the  mailings  and  Ms.  Barbara 
Burkett  and  Ms.  Barbara  Ashooh  who  contributed  typing  support  in  putting  together 
the  proceedings. 


Lee  S.  Baumann 
Science  Applications 
International  Corporation 
Workshop  Organizer 


*  Daryl  T.  Lawton,  "Processing  Dynamic  Image  Sequences  from  a  Moving  Sensor," 
Phd.  Dissertation  and  Coins  Technical  Report  84-05,  University  of  Massachussetts 
at  Amherst,  February  1984. 

*  Daryl  T.  Lawton,  "Processing  Restricted  Sensor  Motion",  Proceedings:  Image 
Understanding  Workshop,  June  1983,  Pps  266-281. 
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IMAGE  UNDERSTANDING  RESEARCH 
AT  THE  UNIVERSITY  OF  MASSACHUSETTS 

Edward  M.  Rise  man  and  Allen  R.  Hanson 

Computer  and  Information  Science  Department 
University  of  Massachusetts 
Amherst,  MA  01003 


ABSTRACT 

Our  DARPA  funded  research  program  continues  to  fo¬ 
cus  on  dynamic  image  processing.  The  work  includes  the  hi¬ 
erarchical  computation  of  more  accurate  displacement  fields 
in  the  presence  of  occlusion,  the  computation  of  general  mo¬ 
tion  parameters  of  the  sensor  and  independently  moving  ob¬ 
jects,  and  architectures  which  will  allow  real-time  image  in¬ 
terpretation.  Closely  related  work  involves  a  new  low-level 
algorithm  for  extracting  straight  lines,  even  those  with  very 
low  contrast,  from  natural  scenes,  and  the  development  of 
a  knowledge-based  system  for  photointerpretation  of  aerial 
images. 

1.  MOTION  PROCESSING  FOR 
RECOVERY  OF  ENVIRONMENTAL  DEPTH 

We  continue  to  explore  and  evolve  several  classes  of 
motion  algorithms  for  processing  image  sequences  from  a 
sensor  moving  through  an  environment.  The  major  goal 
in  motion  processing  is  the  recovery  of  the  motion  param¬ 
eters  of  the  sensor  and  each  independently  moving  object. 
The  computation  of  environmental  depth  of  visible  surfaces 
follows  in  a  rather  straightforward  manner. 

In  the  first  group,  we  have  already  reported  Lawton's 
research  [LAW83,  LAW84]  on  a  class  of  algorithms  for  re¬ 
stricted  cases  of  sensor  motion  through  a  static  environ¬ 
ment;  of  particular  interest  are  pure  translational  process¬ 
ing  and  motion  restricted  to  a  known  plane.  This  algorithm 
combines  the  computation  of  the  image  displacements  form¬ 
ing  the  flow  field  with  the  computation  of  the  sensor  mo¬ 
tion  parameters.  It  avoids  many  of  the  errors  produced  by 
noise,  ambiguity,  and  occlusion  when,  as  usually  is  the  case, 
the  flow  field  is  computed  prior  to  inference  of  the  motion 
parameters.  Current  work  on  these  algorithms  is  directed 
towards  a  more  careful  analysis  of  the  accuracy,  efficiency, 
and  robustness  of  this  approach.  The  performance  of  the 


1  This  work  has  been  supported  primarily  by  the  Defense 
Advanced  Research  Projects  Agency  under  Contract  N00014- 
82-K-0464.  Some  of  the  work  reported  here  has  been  sup¬ 
ported  by  the  AFOSR  under  Contract  F49620-83-C-0099 
and  by  RADC  under  Task  1-4-0055. 


various  algorithms  is  being  quantified  on  a  larger  set  of  im¬ 
ages.  We  intend  to  study  the  sensitivity  of  the  algorithm 
to  the  number  of  feature  points  employed,  the  amount  of 
texture  exhibited  in  the  scene,  the  orientation  of  the  sensor 
relative  to  the  direction  of  motion,  the  rate  of  convergence 
to  a  solution,  and  the  accuracy  of  the  final  depth  map. 

A  second  major  approach  to  motion,  involving  the  anal¬ 
ysis  of  general  sensor  motion  in  an  environment  containing 
independently  moving  objects,  will  be  documented  in  the 
forthcoming  thesis  of  Adiv  [ADI84a].  This  motion  algo¬ 
rithm  first  approximates  the  visual  field  as  rigid  motion  of 
a  set  of  planar  surfaces,  groups  them  into  rigidly  moving 
objects,  and  then  drops  the  planarity  assumption  to  infer 
the  3D  motion  and  depth  of  each  of  these  segments.  Let  us 
consider  the  issues  in  a  little  more  detail. 

The  most  common  approach  for  the  analysis  of  visual 
motion  is  based  on  two  phases:  computation  of  an  optical 
flow  field  and  interpretation  of  this  field.  A  major  prob¬ 
lem  which  has  emerged  in  this  research  area  is  sensitivity 
to  noise.  Flow  fields  generated  by  existing  techniques  are 
noisy  and  partially  incorrect,  especially  near  occlusion  or 
motion  boundaries.  Most  of  the  algorithms  for  interpreting 
these  fields  cannot  successfully  deal  with  realistic  levels  of 
noise.  Global  approaches,  which  utilise  all  the  available  in¬ 
formation,  can  be  expected  to  be  relatively  robust.  Still,  an 
inadequate  choise  of  an  optimisation  criterion  often  limits 
the  performance  of  these  techniques.  Furthermore,  the  pres¬ 
ence  of  independently  moving  objects  usually  makes  such 
global  techniques  impractical. 

These  two  issues,  the  presence  of  noise  and  the  pres¬ 
ence  of  independently  moving  objects,  are  addressed  by  a 
new  scheme  for  interpreting  optical  flow  fields,  which  are 
allowed  to  be  either  dense  or  sparse  [ADl84b|.  In  the  first 
stage  of  this  scheme  the  flow  field  is  segmented  into  con¬ 
nected  sets  of  flow  vectors,  where  each  set  is  consistent  with 
a  rigid  motion  of  a  roughly  planar  surface  and,  therefore, 
is  likely  to  be  associated  with  only  one  rigid  object.  The 
algorithm  for  achieving  such  a  segmentation  is  based  on  a 
modification  of  the  generalised  Hough  technique,  in  which 
flow  vectors  are  grouped  into  components  consistent  with 
affine  transformations  of  the  image.  If  appropriate,  compo¬ 
nents  are  then  merged  to  create  segments. 
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In  the  second  stage  of  the  proposed  scheme  sets  of  seg¬ 
ments  are  hypothesised  to  be  induced  by  the  same  rigidly 
moving  object.  Each  of  these  hypotheses  is  tested  by  search¬ 
ing  for  3-D  motion  parameters  which  are  compatible  with  all 
the  segments  in  the  corresponding  set.  This  search  is  based 
on  a  least-squares  technique  which  minimises  the  deviation 
between  the  given  flow  field  and  that  predicted  from  the 
computed  parameters.  Once  the  motion  parameters  are  re¬ 
covered,  the  relative  environmental  depth  can  be  estimated. 

This  technique,  of  segmenting  the  flow  field  and,  then 
combining  segments  to  form  object  hypotheses,  makes  it 
possible  to  deal  with  independently  moving  objects  while 
employing  all  the  available  information  associated  with  each 
object.  In  addition,  the  optimisation  criterion  for  recovering 
3-D  information  is  appropriate  for  dealing  with  partially  in¬ 
correct  flow  fields.  Thus,  the  proposed  scheme  is  relatively 
insensitive  to  noise.  Experiments,  based  on  real  and  simu¬ 
lated  data,  involve  flow  fields  which  are  noisy  and  partially 
incorrect,  especially  near  occlusion  boundaries  [ADl84bj. 
The  successful  results  demonstrate  the  effectiveness  of  the 
scheme  in  such  situations.  There  are,  however,  inherent 
ambiguities  in  the  interpretation  of  noisy  flow  field.  These 
ambiguities  will  be  analysed  and  demonstrated  in  [ADl84aj. 

J.  COMPUTING  ACCURATE  DISPLACEMENT 
FIELDS  IN  THE  PRESENCE  OF  OCCLUSION 

One  method  for  computing  displacement  fields  prior 
to  depth  and  motion  parameter  computations  is  by  area 
correlation.  However,  the  displacement  fields  produced  by 
this  method  are  often  incorrect  in  homogeneous  areas  of 
the  image  and  near  occlusion  boundaries  (where  an  area 
is  visible  in  one  frame  and  occluded  in  the  next).  Anan- 
dan  [ANA84a,b]  has  developed  a  matching  algorithm  which 
overcomes  many  of  the  problems  with  current  techniques. 
The  algorithm  is  obtained  by  modifying  the  search  strategy 
employed  in  the  hierarchical  algorithm  of  Glaser,  Reynolds, 
and  Anandan  [GLA83]  to  take  into  account  the  reliability  of 
each  displacement  vector  as  provided  by  a  confidence  mea¬ 
sure.  The  result  is  a  computationally  efficient  matching 
algorithm  which  provides  a  dense  displacement  field  with 
estimates  of  the  reliability  of  each  displacement  vector. 

The  matching  algorithm  developed  by  Glaser  et  al  is  a 
hierarchical  coarse-fine  strategy  using  band-pass  filtered  im¬ 
ages  in  which  matches  found  at  the  coarse  (low-frequency) 
levels  direct  the  search  for  matches  at  the  finer  (higher  fre¬ 
quency)  levels.  In  this  algorithm,  each  pixel  at  a  coarse  level 
covers  a  large  area  at  the  finest  level;  errors  at  the  coarse 
level  cause  incorrect  initial  estimates  to  be  generated  for  the 
finer  levels  and  hence  the  search  at  the  finer  levels  is  car¬ 
ried  out  in  areas  which  do  not  include  the  correct  match. 
The  confidence  measure  attempts  to  detect  these  and  sim¬ 
ilar  situations  so  that  the  search  strategy  can  be  modified 
accordingly. 

The  confidence  measure  used  is  based  on  the  shape 
of  the  SSD  surface  generated  by  the  values  of  the  sum- 
of-squared-differences  corresponding  to  different  candidate 


displacements  from  the  pixel.  For  example,  the  SSD  surface 
tends  to  have  a  rather  flat  behavior  in  homogeneous  areas 
of  the  image  as  opposed  to  a  much  sharper  valley  in  areas 
where  there  is  distinct  image  structure.  The  confidence 
measure  is  computed  as  the  minimum  of  the  normalised 
second  derivatives  of  the  SSD  surface  computed  at  0,  45, 
00,  and  135  degrees  by  a  1x3  Laplacian  operator  centered 
at  the  point  of  best  match.  This  measure  tends  to  be  low  in 
homogeneous  and  occluded  areas,  and  along  edges;  hence 
it  is  useful  in  isolating  the  areas  in  an  image  where  the 
displacement  estimates  are  unreliable  and  often  incorrect. 

The  original  search  strategy  described  in  Glaser  started 
at  a  coarse  level  at  which  all  image  displacements  were  at 
most  one  pixel.  Matches  found  in  a  3x3  window  were  pro¬ 
jected  to  the  four  pixels  at  the  next  finer  level  of  resolution 
and  a  search  conducted  in  a  3x3  window  around  these 
estimates.  This  process  continued  down  to  the  finest  level 
of  resolution.  Anandan  modifies  this  search  strategy  in  two 
ways: 

1.  Only  high  confidence  coarse  estimates  are  projected. 
The  motivation  for  this  is  that  when  incorrect  coarse 
estimates  are  proejcted,  the  3x3  searches  at  the  finer 
levels  are  conducted  in  areas  of  the  second  frame  that 
do  not  include  the  true  match  point,  which  in  turn 
causes  incorrect  searches  at  all  subsequent  levels. 

and  2.  Estimates  may  be  projected  to  an  area  larger  than  the 
four  descendant  pixels.  If  the  incorrect  coarse  level 
projections  are  eliminated,  then  the  finer  level  search 
can  be  conducted  over  a  larger  area  and  the  true  match 
can  perhaps  be  recovered. 

The  modified  algorithm  appears  to  work  well  in  com¬ 
plex  image  pairs.  Future  work  involves  using  the  confidence 
measure  and  other  attributes  of  the  SSD  surface  to  distin¬ 
guish  between  occluded  and  homogeneous  areas  and  to  use 
directional  information  in  the  SSD  surface  to  modify  the 
expansion  of  the  search  area  in  a  direction  perpendicular  to 
an  edge. 

3.  EXTRACTING  STRAIGHT  LINES 

The  organisation  of  local  intensity  changes  into  the 
more  global  abstractions  called  'lines*  or  'boundaries*  is 
an  early  and  important  step  in  the  transformation  of  the  vi¬ 
sual  signal  into  intermediate  constructs  useful  for  interpre¬ 
tation  processes.  A  recent  algorithm  developed  by  Burns 
[BUR84a,b,c|  uses  gradient  orientation,  rather  than  changes 
in  image  intensity,  as  the  initial  organising  feature.  The 
general  approach  consists  of  four  basic  steps: 

1)  Group  pixels  into  line-support  regions  based  on  simi¬ 
larity  of  gradient  orientation.  This  allows  data  directed 
organisation  of  edge  contexts  without  commitment  to 
masks  of  a  particular  sise. 

2)  Approximate  the  intensity  surface  bv  a  planar  surface 
The  planar  fit  is  weighted  by  the  gradient  magnitude 
associated  with  the  pixels  so  that  intensities  in  the 
steepest  part  of  the  edge  will  dominate. 


3)  Extract  attribute*  from  the  line-support  region  and 
the  planar  fit.  The  attributes  extracted  include  the 
representative  line  and  its  length,  contrast,  width,  lo¬ 
cation,  orientation,  and  straightness. 

4)  Filter  lines  on  the  attributes  to  isolate  various  image 

events  such  as  long  straight  lines  of  any  contrast;  high 
contrast  short  lines  (heavy  texture);  low  contrast  short 
lines  (light  texture);  homogeneous  regions  of  adjacent 
very  low  contrast  lines;  and  lines  at  particular  orienta¬ 
tions  and  postitions.  • 

Gradient  orientation  at  a  pixel  is  estimated  from  the 
horisontal  and  vertical  components  of  the  gradient  obtained 
from  two  2x2  masks.  These  estimates  are  then  grouped 
into  regions  by  using  two  overlapping  sets  of  partitions  of 
fixed  site,  say  two  45°  sets  staggered  by  22.5°  or  two  22.5° 
sets  staggered  by  11.25s.  Each  set  produces  a  region  seg¬ 
mentation  of  the  gradient  image  and  then  these  segmenta¬ 
tions  are  merged  by  choosing  that  region  satisfying  local 
constraints.  Each  resulting  “line  support  region*  is  a  can¬ 
didate  area  for  a  straight  line  since  the  local  gradient  esti¬ 
mates  share  a  common  orientation.  A  representative  line 
can  be  extracted  by  intersecting  a  least  squares  planar  es¬ 
timate  of  the  underlying  intensity  surface  with  the  plane 
corresponding  to  the  average  region  intensity. 

A  set  of  attributes  is  extracted  from  the  line  support 
region  and  its  associated  line.  These  attributes  include  line 
length,  contrast,  width,  steepness,  straightness  (or  devi¬ 
ation  from  straightness),  and  various  orientation  and  posi¬ 
tion  parameters.  They  form  the  basis  of  a  line  data  base  for 
an  image  over  which  various  filtering  operations  can  be  per¬ 
formed  in  order  to  extract  lines  with  specific  properties.  For 
example,  in  many  of  our  images  long  high  contrast  lines  cor¬ 
respond  to  significant  boundaries,  medium  contrast  short 
lines  may  correspond  to  textured  areas,  and  low  contrast 
wide  lines  to  slow  intensity  gradients. 

The  algorithm  does  an  excellent  job  of  extracting  straight 
lines,  including  very  low  contrast  lines.  It  represents  all  lines 
as  straight  lines  and  we  are  examining  ways  in  which  the 
same  general  technique  could  be  used  to  generate  curvilin¬ 
ear  lines  and  boundaries.  The  gradient  orientation  grouping 
operator  does  a  credible  job  for  such  a  simple  technique  and 
we  are  looking  at  methods  for  detecting  and  overcoming  the 
overgrouping  and  other  errors  it  sometimes  produces.  Fi¬ 
nally,  we  are  examining  a  variety  of  line  descriptors  and 
are  incorporating  them  into  other  interpretation  processes 
(e.g.,  see  [REY84bJ  in  this  proceedings). 

4.  INTERPRETATION  OP 
AERIAL  PHOTOGRAPHS 

The  detailed  examination,  segmentation,  and  under¬ 
standing  of  high  resolution  digital  images  represents  a  se¬ 
vere  computational  load  for  current  computers.  One  tech¬ 
nique  for  reducing  the  overall  computational  requirements 
involves  selectively  focussing  on  relevant  portions  of  an  im¬ 
age  and  ignoring  irrelevant  portions.  The  specification  of 
relevancy  implies  some  external  model  which  represents  a 


description  of  those  areas  or  objects  that  are  of  potential 
interest  and  to  which  computational  resources  may  most 
fruitfully  be  applied.  The  most  suitable  method  for  ap¬ 
plying  such  selective  processing  to  high  resolution  imagery 
is  the  multi-resolution,  or  pyramid,  technique.  From  the 
original,  large-scale,  full  resolution  image  is  constructed  a 
progression  of  smaller  and  smaller  images,  each  covering  the 
same  extent,  but  at  successively  coarser  resolution. 

In  this  section  we  describe  some  recent  experiments 
using  a  hierarchical  segmentation  algorithm  and  focus  of 
attention  mechanism  for  locating  buildings,  roads,  and  air¬ 
ports  in  a  high-resolution  monochromatic  aerial  image.  The 
approach  involves  formulating  segmentation  and  feature  ex¬ 
traction  algorithms  as  hierarchical  algorithms  within  the 
processing  cone  [HAN80].  The  focus  of  this  section  is  on 
the  segmentation  processes;  more  complete  interpretation 
results  may  be  found  in  [REY84a,b]. 

The  general  idea  is  to  use  the  Nagin-Kohler  localised 
histogram  region  segmentation  algorithm  [NAG79,  KOH83J 
and  the  Burns  straight  line  extraction  algorithm  as  the 
primary  low-level  processes  used  to  drive  the  bottom-up 
component  of  a  hierarchical  localised  segmentation  process. 
The  feature  extraction  process  yields  a  low-level  represen¬ 
tation  of  the  data  and  an  evidential-based  inference  net  is 
used  to  transform  this  data  to  an  intermediate  level  of  rep¬ 
resentation  within  long-term  memory.  This  intermediate 
level  of  representation  in  turn  allows  the  multiresolution 
segmentation  algorithms  to  be  focussed  and  selectively  ap¬ 
plied  to  areas  of  interest  in  the  image.  We  are  investigat¬ 
ing  the  effectiveness  of  directing  the  system  to  look  only 
in  areas  where  a  coarse  level  segmentation  yields  a  hypoth¬ 
esis  that  an  object  of  the  sort  we  are  looking' for  exists. 
For  high-resolution  imagery,  the  computational  advantage 
of  this  approach  is  significant.  For  example  if  we  assume 
that  even  2/3  of  the  possible  sectors  are  used  at  each  level, 
then  only  1/5  of  the  image  is  being  examined  4  levels  down. 
In  the  case  that  only  1/4  of  the  sectors  are  selected,  only 
1 /256th  of  the  image  is  being  searched  4  levels  down.  Thus 
the  computational  complexity  of  the  process  can  be  kept 
within  reasonable  bounds. 

The  selection  of  candidate  regions  for  examination  at 
a  higher  resolution  is  accomplished  by  choosing  all  regions 
which  satisfy  a  set  of  object  dependent  constraints  on  re¬ 
gion  and  line  attributes.  In  general,  the  results  from  such 
a  simple  rule  will  be  unreliable  and  prone  to  error.  Image 
interpretation  is  implicitly  involved  with  the  problems  as¬ 
sociated  with  combining  information  from  multiple  sources 
of  knowledge.  Any  perceptual  system  which  utilises  pro¬ 
cessed  sensory  data  must  recognise  the  fact  that  to  varying 
degrees  the  information  will  be  imperfect  and  prone  to  error. 
With  this  in  mind  we  are  developing  mechanisms  for  eviden¬ 
tial  reasoning  and  inferencing  under  uncertainty  [LOW82, 
WES82]  in  order  to  construct  more  reliable  focussing  sets. 

Some  of  the  limitations  of  inferencing  using  Bayesian 


probability  model*  are  overcome  using  the  Dempster-Shafer 
formalism  for  evidential  reasoning,  in  which  an  explicit  rep¬ 
resentation  of  partial  ignorance  is  provided  [SHA76).  The 
inferencing  model  allows  'belief*  or  'confidence*  in  a  propo¬ 
sition  to  be  represented  as  a  range  within  the  (0,1)  interval. 
The  lower  and  upper  bounds  represent  support  and  plausi¬ 
bility,  respectively,  of  a  proposition,  while  the  width  of  the 
interval  can  be  interpreted  as  ignorance.  Wesley  (WES83] 
is  extending  this  approach  to  the  problem  of  distributed 
control  of  a  set  of  knowledge  sources  which  can  be  applied 
to  examine  particular  concepts  in  long-term  memory. 

Once  the  inference  network  is  fully  integrated,  we  ex¬ 
pect  the  hierarchical  segmentation  and  interpretation  pro¬ 
cess  to  operate  as  follows.  First  the  local  histogram-based 
region  segmentation  and  the  linear  feature  extraction  algo¬ 
rithm  are  applied  at  a  coarse  level  of  resolution.  Knowledge 
sources  in  the  form  of  object  hypothesis  rules  are  then  ap¬ 
plied  to  region  and  line  attributes  at  that  level.  The  output 
of  these  rules  is  converted  to  a  form  appropriate  for  input 
into  the  inference  network  of  long-term  memory.  The  in¬ 
ferencing  process  is  then  invoked  and  each  region  yields  a 
support  and  plausibility  (i.e.,  a  range  of  belief)  that  it  is 
a  candidate  region  for  one  of  the  goal  objects.  The  region 
and  line  segmentation  algorithms  are  then  applied  at  a  finer 
level  of  resolution,  but  only  on  the  candidate  regions  which 
have  high  support.  At  this  finer  level  of  resolution  the  repre¬ 
sentation  of  the  object  is  of  a  different  form  and  may  involve 
more  expensive  object  rule  combinations  of  the  region  and 
line  attributes,  but  applied  only  to  a  small  subset  of  the 
image.  The  process  is  recursively  applied  to  finer  levels  of 
resolution. 

6.  CONTENT  ADDRESSABLE  ARRAY 
PARALLEL  PROCESSOR  (CAAPP) 

For  the  last  several  years  there  has  been  a  synergis¬ 
tic  relationship  between  our  machine  vision  group  and  a 
parallel  architecture  group  led  by  Professor  Caxtop  Foster 
[WEE84,  LEV84] .  We  have  designed  and  are  now  proposing 
the  construction  of  a  large  scale  Content  Addressable  Ar¬ 
ray  Parallel  Processor  (CAAPP)  for  low,  medium  and  high 
level  vision  processing.  This  new  architecture  combines 
associative  processing  via  global  broadcast  and  response  to 
and  from  an  array  of  cells,  and  array  processing  via  local 
cellular  square  neighborhood  computation.  The  resulting 
architecture  allows  simple  solutions  to  many  problems  that 
are  difficult  for  parallel  machines  which  provide  only  one  of 
these  capabilities  (WEE83). 

The  prototype  CAAPP  would  consist  of  sixteen  thou¬ 
sand  processing  elements  arranged  as  a  128  x  128  square 
array.  We  have  taken  a  pragmatic  view  of  VLSI  technology 
because  we  wish  to  actually  construct  this  machine.  Hence, 
we  have  approached  the  design  in  a  very  conservative  man¬ 
ner,  making  use  only  of  existing  technology  ( NMOS), 
to  insure  rapid  and  successful  development.  This  design  is 
intended  to  be  expandable  up  to  at  least  a  quarter  of  a  mil¬ 
lion  processing  elements  in  a  S12  x  512  array.  The  initial 


16K  processor  parallel  machine  will  have  an  effective  oper¬ 
ating  speed  several  hundred  to  a  thousand  time*  that  of  the 
fastest  sequential  processor  available  today.  The  CAAPP 
would  be  connected  via  its  own  controller  to  a  VAX- 11/780, 
LISP  machine,  or  some  other  general  purpose  computing 
machine  which  would  provide  both  the  algorithm  develop¬ 
ment  environment  and  the  operating  environment  for  the 
system. 

The  CAAPP  is  capable  of  providing  an  intermediate 
symbolic  representation  by  storing  the  results  of  low  level 
vision  algorithms  and  providing  the  communication  inter¬ 
face  to  knowledge-based  processing  [FOS84|.  In  addition 
to  its  two  dimensional  communication  pathways  between 
neighboring  cells  and  bit-serial  local  cellular  computation, 
this  device  would  be  capable  of  content  addressable  mem¬ 
ory  (associative)  functions  including  broadcast  instruction, 
find-first-responder,  and  count-responders.  These  opera¬ 
tions  permit  the  feedback  loop  to  be  closed  between  high- 
level  processing  and  low  level  processing  by  allowing  com¬ 
munication  and  control  information  to  flow  up  and  down 
between  the  different  representations  supported  by  the  ar¬ 
chitecture. 

The  intermediate  level  of  representation  provides  an 
interface  between  the  low  and  high  levels  of  representation, 
that  is,  between  pixel-based  representation  and  symbolic  el¬ 
ements  representing  visual  knowledge  stored  in  a  database. 
In  the  UMASS  VISIONS  system,  the  intermediate  level  con¬ 
sists  of  a  symbolic  description  of  the  two  dimensional  image 
in  terms  of  regions  and  line  segments  (that  are  still  in  regis¬ 
tration  with  the  raw  image  data)  as  well  as  their  associated 
attributes  which  can  be  used  in  the  interpretation  process. 
In  some  systems  this  level  would  consist  of  representations 
of  surfaces,  or  more  generally,  'intrinsic*  features  of  the 
physical  environment. 

Intermediate  processing  includes  several  kinds  of  activ¬ 
ities.  First  is  the  set  of  bottom-up  tasks  which  are  needed 
to  complete  the  intermediate  level  of  representation.  This 
includes  the  extraction  of  the  features  for  regions,  lines,  and 
vertices  as  well  as  the  relations  between  these  entities.  The 
second  group  of  intermediate  processing  activities  involve 
grouping,  splitting,  and  labelling  processes,  in  either  data- 
directed  or  knowledge  directed  modes  (i.e.,  bottom-up  or 
top-down)  to  form  intermediate  events  which  more  natu¬ 
rally  match  stored  object  descriptions. 

We  believe  that  the  key  to  vision  procetring  it  a  flow 
of  communication  and  control  both  up  and  down  through  all 
repretentation  levels.  Communication  between  these  levels 
is  by  no  means  unidirectional.  In  most  cases,  recognition  of 
an  object  or  part  of  a  scene  at  the  high  level  will  establish 
a  strategy  for  further  processing  and  probing  the  low  and 
intermediate  levels,  in  order  to  pull  out  additional  features 
under  the  guidance  of  a  partial  interpretation.  This  might 
involve  joining  together  region,  line,  and  surface  informa¬ 
tion  to  form  a  symbolic  representation  would  more  easily 
and  naturally  match  a  stored  object  description. 
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1.  Abstract 

The  Image  Understanding  Project  at  Columbia  continues  to 
center  its  efforts  on  basic  “middle-level”  vision  research:  the 
representations  and  algorithms  concerned  with  deriving 
surface  information  from  low-level  aggregate  cues.  The 
project  has  expanded  somewhat  to  cover  six  major  concerns 
in  image  understanding:  the  complexity  of  algorithms,  the 
theory  and  analysis  of  texture,  the  integration  of  systems, 
the  development  of  research  aids,  and  the  exploitation  ol 
two  parallel  architectures.  This  report  on  our  second  full 
year  summarizes  our  progress  in  each  of  these  areas;  we 
note  the  graduation  of  our  first  doctoral  student. 


2.  Introduction 

The  Image  Understanding  Project  at  Columbia  continues  a 
steady  growth.  We  have  augmented  our  hardware  and 
software  base  (Vax  plus  GrinneJI,  running  melded  CMU  and 
SRI  vision  libraries)  by  incorporating  an  image  processor 
board  and  by  implementing  a  flexible  convolution  package. 
Our  research  is  a  bit  wider  in  scope,  although  it  still 
emphasizes  that  level  of  image  understanding  which 
moderates  low-level  cues  into  surface  information;  we  have 
six  investigations. 

We  have  taken  the  information-centered  approach  to 
optimal  algorithms  and  applied  it  to  the  sparse  depth  data 
interpolation  task  (and  others),  with  surprising  results.  We 
are  extending  the  shape-from-texture  theory  by  investigating 
the  applicability  of  scale-space  texture  filtering  and  fractal 
texture  descriptions.  Work  continues  on  the  integration  of 
multiple  texture  algorithms  into  a  coherent  shape  analysis 
system.  We  have  pursued  the  development  of  various 
graphic  aids  for  the  image  understanding  experimenter.  For 
the  Non-Von  supercomputer,  we  have  simulated  and 
critiqued  a  wide  range  of  image  understanding  algorithms 
For  the  Grinnell  image  processor,  we  have  written  and 
experimentally  verified  routines  for  highly  efficient  low-level 
edge  detection 


3.  The  Complexity  of  Image  Algorithms 

Under  independent  development  at  Columbia  is  a  theory  of 
computational  complexity  called  the  information-centered 
approach  to  optimal  algorithms  It  has  proven  surprisingly 
successful  in  the  analysis  of  image  understanding  tasks.  We 
(David  Lee)  have  applied  it  to  the  problem  of  reconstructing 
a  surface  from  sparse  depth  information,  and  we  have  found 
that  the  problem  is  optimally  solved,  in  constant  time  by 
spline  interpolation;  no  adaptive  algorithm  would  perform 
better  [8],  Since  splines  are  easily  computed  by  parallel 
SIMD  algorithms,  we  foresee  the  possibility  of  very  simple 
interpolation  machines,  based  on  binocularity  or  range 
finding.  We  (Terry  Boult)  have  empirically  investigated  the 
computational  properties  of  the  splines  on  synthetic  data, 
and  discovered  encouraging  regularities  in  its  behavior  which 
we  hope  to  exploit  for  high  efficiency. 

Preliminary  results  indicate  that  similar  mathematical 
approaches  to  the  problems  of  shape-from-shading  and  of 
optical  flow  may  result  in  similar  optimal  algorithms.  Thus, 
it  may  be  possible  to  put  a  fair  amount  of  low-level  image 
understanding  algorithms  on  a  unified  theoretical  footing 


4  The  Analysis  of  Texture 

Many  of  the  algorithms  we  have  devised  for  our  middle- 
level  work  are  derived  from  a  central  methodological 
paradigm  called  “shape  from  texture”  f9j.  We  have  most 
recently  applied  the  paradigm  in  two  additional  areas, 
deriving  further  surface  constraint  relations  and  procedures. 
W'e  have  shown  how  to  exploit  the  gravitationally  induced 
environmental  labels  such  as  “horizontal”  and  “vertical”  [6|, 
and  how  to  exploit  the  assumptions  of  equality  of  linear 
extents  such  as  equal  spacing  or  widths  [7J. 

In  work  done  jointly  with  IBM,  we  (Paul  Douglas)  have 
begun  to  extend  the  theory  by  examining  the  textural 
residue  that  remains  after  a  surface  has  been  represented  in 
its  scale-space  filtered  form.  We  are  pursuing  several 
conjectures  about  the  fractal  dimension  of  this  residual 
signal,  relative  to  the  surface’s  perceived  orientation  and 
distance.  In  related  work,  we  nave  explored  the  use  of 
simulated  annealing  as  a  control  structure  for  the  perception 
of  “emergent"  visual  phenomena  such  as  subjective 
groupings  of  textured  surfaces. 


5.  Integrated  Image  Systems 

Our  (Mark  Moerdler)  work  continues  on  a  middle-level 
vision  system  that  integrates  knowledge  about  surfaces  from 
multiple  independent  texture-based  sources.  Two  knowledge 
sources  have  been  refined,  based  on  texture  density  and 
texture  spacing  cues-  they  are  tested  for  accuracy  on  several 
synthetic  (but  noisy)  images.  The  sources  are  beginning  to 
develop  specialized  heuristics:  for  textural  segmentation,  and 
for  the  Dandling  of  degeneracies.  Several  other  sources, 
based  on  texture  orientation  and  textural  area,  have  been 
implemented  and  are  waiting  to  be  incorporated  into  the 
system. 


6.  Image  Research  Aids 

Image  understanding  systems  produce  vast  amounts  of 
complex  intermediate  data:  the  middle  levels  of  vision 
generate  multiple  assertions  about  any  underlying  surface. 
We  have  explored  some  graphic  aids  to  the  perception  of 
these  hypotheses  on  surface  orientation,  but  without  great 
success.  The  use  of  sequins  (circles  seen  in  perspective  so 
that  their  ellipsoidal  shapes  are  suggestive  of  underlying 
slant  and  tilt)  appears  fairly  promising,  however 

In  a  related  effort,  we  (David  Freudenstein)  are  in  the 
beginnings  of  the  construction  of  an  environment  for  the 
easy  intermingling  of  images  and  textual  information,  for 
communication,  documentation,  and  publication  purposes. 


7  Image  Understanding  on  Non-Von 

Under  independent  development  at  Columbia  is  the  Non- 
Von  supercomputer,  one  of  a  class  of  fine-grained  tree- 
structured  machines  built  of  custom  VLSI  chips.  We 
(Hussein  Ibrahim,  in  his  thesis  |2j)  have  found  that  such 
architectures  lend  themselves  easily  and  naturally  to  many 
low-level  vision  algorithms,  and,  with  some  care,  to  middle- 
level  vision  algorithms  as  well. 


Vision  algorithms  were  selected  to  span  different  levels  of 
computer  vision  tasks  [3].  They  included  image  correlation, 
histograming,  connected  component  labeling  [4|,  geometric 
property  computations,  set  operations,  the  Hougn  transform 
method  for  detecting  object  boundaries  [S],  and  the 
correspondence  methods  for  moving  light  displays.  The 

encoding  of  the  algorithms  incorporated  novel  approaches  to 
reduce  the  effects  of  the  communication  bottleneck  usually 
associated  with  tree  architectures. 

Performance  was  studied  using  two  simulators',  all 

algorithms  were  simulated  on  a  functional  level  simulator, 
and  some  were  also  simulated  on  a  machine  instruction  level 

simulator.  In  general,  we  have  found  the  performance  of 

this  class  of  tree  machines  to  be  superior  to  other  highly 
arallel  architectures  for  image  understanding  problems,  we 
ave  also  identified  the  limitations  of  these  architectures, 
and  have  proposed  methods  to  ameliorate  them. 

In  other  related  work  done  jointly  with  Bell  Laboratories, 
we  (Marcia  Derr)  have  explored  the  nearly  opposite  problem 
of  distributing  visual  computation  over  a  very  coarse-grained 
multiprocessor:  several  Motorola  68000s  on  a  high-speed  bus. 
No  straightforward  way  of  partitioning  the  algorithms  or  the 
images  has  proven  fully  effective  over  the  entire  class  of 
problems  considered. 


8.  Exploiting  the  Grinnell  Image  Processor 

We  have  upgraded  our  Grinnell  with  an  Image  Processor 
board,  whose  primitive  16-bit  ALU  imbedded  in  the  memory 
refresh-cycle  performs  as  an  extremely  fine-grained  SIMD 
machine.  After  writing  a  general  purpose  convolution 
package  for  it  (lj,  we  (Christian  Fortunel  and  John  Render) 
were  able  to  design  for  it  very  fast  parallel  algorithms  to  do 
edge  detection.  By  various  stratagems  we  have  reduced  the 
0(<rJ)  time  for  a  series  of  Laplacian  of  Gaussian  to  0(1), 
with  very  little  loss  of  accuracy. 

Briefly,  the  two-dimensional  band-pass  operators  can  be 
decomposed  into  sums  or  differences  of  other  functions  in 
multiple  ways,  these  component  functions  can  further  be 
separated  by  variables  into  outer  products  of  one-dimensional 
functions.  Intermediate  results  can  be  calculated  so  that 
partial  computations  are  cascaded.  Further  time  reductions 
are  possible  by  expanding  the  operator  support.  Under 
proper  scaling  of  intermediate  results,  integer  arithmetic 
alone  suffices  and  expensive  floating  point  can  be  dispensed 
with  altogether  Zero-crossings  are  localized  with  a  prairie- 
fire  technique 

The  above  code,  written  for  the  CMU  Grinnell 
environment  (Vax/Unix/C,  with  IU  testbed  standards),  has 
been  distributed  to  other  ARPA  sites  with  comparable 
equipment. 

The  Grinnell  Image  Processor’s  ability  to  instantaneously 
pan  over  wide  ranges  of  pixels  has  be  exploited  in  another 
related  way:  to  implement  a  fine-grained  message-passing 
parallel  shape-from-texture  algorithm  1 10].  Although 
requiring  some  skill  in  design  and  debugging,  it  was 
surprising  to  be  able  to  cast  what  was  essentially  a  row-wise 
sort  into  a  simple  SIMD  procedure.  The  algorithm,  which 
generates  vanishing  lines  based  on  texture  density  gradients, 
was  executed  on  a  real  image  with  good  result. 
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Abstract 

The  SRI  Image  Understanding  program  is  a  broad  effort 
spanning  the  entire  range  of  machine  vision  research.  Its  three 
major  concerns  are:  (1)  to  develop  an  understanding  of  the 
physics  and  mathematics  of  the  vision  process,  (2)  to  develop  a 
know  ledge- based  framework  for  integrating  and  reasoning  about 
sensed  (imaged)  data,  and  (3)  to  develop  a  machine-based  en¬ 
vironment  for  effective  experimentation,  demonstration,  and 
evaluation  of  our  theoretical  results,  as  well  as  providing  a  vehicle 
for  technology  transfer.  This  report  describes  recent  progress  in 
all  three  areas.  In  particular,  we  have  shown  that  fractal  func¬ 
tions  are  an  effective  tool  for  representing  natural  shapes  and 
provide  a  good  basis  for  recovering  3-D  shape  from  the  shading 
and  texture  in  a  single  image.  For  scenes  containing  man-made 
objects,  we  have  found  ways  of  using  straight  edges  to  recover 
the  3-D  orientation  of  surfaces  from  a  single  view,  and  reason 
about  the  shape  of  an  object  from  partial  information  in  multiple 
views.  We  have  built  a  new  and  powerful  LISP-machioe-based 
environment  for  use  in  image  understanding  research,  and  are 
putting  together  high-performance  systems  for  stereo  compila¬ 
tion,  feature  extraction,  and  linear  delineation. 


1.  INTRODUCTION 

The  goal  of  this  research  program  is  to  obtain  solutions 
to  fundamental  problems  in  computer  vision,  particularly  to 
such  problems  as  stereo  compilation,  feature  extraction,  linear 
delineation,  and  general  scene  modeling  that  are  relevant  to  the 
development  of  an  automated  capability  for  interpreting  aerial 
imagery  and  the  production  of  cartographic  products. 

To  achieve  this  goal,  we  are  engaged  in  investigations  of  such 
basic  issues  as  image  matching,  partitioning,  representation,  and 
physical  modeling  (shape  from  shading,  texture,  and  optic  flow; 
material  identification;  recovery  of  imaging  and  illumination 
parameters  surh  as  “vanishing  points,"  “camera  parameters,"  il¬ 
lumination  source  location;  edge  classification;  etc.)  However,  it 
is  obvious  that  high-level,  high-performance  vision  requires  the 
use  of  both  intelligence  and  stored  knowledge  (to  provide  an  in¬ 
tegrative  framework),  as  well  as  an  understanding  of  the  physics 
and  mathematics  of  the  imaging  process  (to  provide  the  basic 
information  needed  for  a  reasoned  interpretation  of  the  sensed 
data).  Thus,  a  significant  portion  of  our  work  is  devoted  to 
developing  new  approaches  to  the  problem  of  “knowledge-based 
vision.*  Finally,  vision  research  cannot  proceed  without  a  means 


for  effective  implementation,  demonstration,  and  experimental 
verification  of  theoretical  concepts;  we  have  developed  an  en¬ 
vironment  in  which  some  of  the  newest  and  most  effective  com¬ 
puting  instruments  can  be  employed  for  these  purposes. 


2.  KNOWLEDGE  BASED  VISION: 
the  Construction  of  an  Expert  System 
Control  Structure  for  Stereo  Compilation 
and  Feature  Extraction. 

Our  intent  in  this  effort  is  to  develop  a  system  framework 
for  allowing  higher  level  knowledge  to  guide  and  integrate 
the  detailed  interpretation  of  imaged  data  by  autonomous 
scene  analysis  techniques.  Such  an  approach  allows  sym¬ 
bolic  knowledge,  provided  by  higher-level  knowledge  sources,  to 
automatically  control  the  selection  of  appropriate  algorithms, 
arijixt  their  parameters,  and  apply  them  in  the  relevant  por¬ 
tions  of  the  image.  More  significantly,  we  are  attempting  to 
provide  an  efficient  means  for  supplying  and  using  qualitative 
knowledge  about  the  semantic  and  physical  structure  of  a  scene 
so  that  the  machine-produced  interpretation,  constrained  by  this 
knowledge,  will  be  consistent  with  what  is  generally  true  of  the 
overall  scene  structure,  rather  than  just  a  good  fit  to  locally 
applied  models. 

An  important  component  of  our  approach  is  to  design  a 
means  for  a  human  operator  to  simply  and  effectively  provide 
the  machine  with  a  qualitative  scene  description,  in  the  form  of 
a  semantically  labeled  3-D  “sketch.”  This  capability  for  effective 
communication  between  a  human  and  the  machine  about  the 
three-dimensional  world  requires  both  appropriate  graphics  tools 
and  an  ability  on  the  part  of  the  machine  for  both  spatial  reason¬ 
ing  and  some  semantic  “understanding.”  The  importance  of  this 
work  derives  from  the  fact  that  a  major  difficulty  in  automating 
the  image-interpretation  process  is  the  inability  of  current  com¬ 
puter  systems  to  deduce,  from  the  visible  image  content,  the 
general  context  of  the  scene  (e.g.,  urban  or  rural;  season  of  the 
year;  what  happened  immediately  before,  and  wbat  will  happen 
immediately  after,  the  image  was  viewed  by  the  sensor)  -  the 
knowledge  base  and  reasoning  required  for  such  an  ability  is  well 
beyond  what  the  state  of  our  art  can  hope  to  accomplish  over  (at 
least)  the  next  6  years.  Thus,  our  work  is  intended  to  provide 
an  interim  means  by  which  a  human  ran  supply,  a  task-oriented 
program,  the  high  level  overview  it  needs  for  its  analysis,  but 
cannot  acquire  by  itself. 

Two  of  our  major  integrative  efforts  are  directly  concerned 
with  the  knowledge-based  vision  problem: 
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One  effort,  integrating  our  work  in  stereo  compilation  and  physi¬ 
cal  modeling,  is  the  construction  of  a  rule-based  system  with  a 
library  of  processes  and  activities,  which  can  be  invoked  to  carry 
out  specific  goals  in  the  domain  of  cartographic  analysis  and 
stereo  reconstruction.  This  work  is  based  on  results  described 
below,  but  the  integrative  framework  is  still  being  developed  and 
will  not  be  described  in  this  report. 

The  second  effort,  described  in  a  following  section  on  fea¬ 
ture  extraction,  is  a  restricted  version  of  the  concept  discussed 
above  (it  employes  contextual  and  semantic  knowledge,  but  does 
not  address  the  issues  of  qualitative  reasoning  nor  3-D  spatial 
understanding). 


3.  DEVELOPMENT  OF  METHODS  FOR 
MODELING  AND  USING  PHYSICAL 
CONSTRAINTS  IN  IMAGE 
INTERPRETATION. 

Our  goal  in  this  work  is  to  develop  methods  that  will  first 
allow  us  to  produce  a  sketch  of  the  physical  nature  of  a  scene 
and  the  illumination  and  imaging  conditions,  and  then  permit 
us  to  use  this  physical  sketch  to  guide  and  constrain  the  more 
detailed  descriptive  processes  -  such  as  precise  stereo  mapping. 

Our  approach  is  to  develop: 

•  models  of  the  relationship  between  physical  objects  in  the 
scene  and  the  intensity  patterns  they  produce  in  an  image  (e.g., 
models  that  allow  us  to  classify  intensity  edges  in  an  image  as 
either  shadow,  or  occlusion,  or  surface  intersection,  or  material 
boundaries  in  the  scene), 

•  models  of  the  geometric  constraints  induced  by  the  projective 
imaging  process  (e  g  .  models  that  allow  us  to  determine  the 
location  and  orientation  of  the  camera  that  acquired  the  image, 
location  of  the  vanishing  points  induced  by  the  interaction  be¬ 
tween  scene  and  camera,  location  of  a  ground  plane,  etc  ),  and 

•  models  of  the  illumination  and  intensity  transformations 
caused  by  the  atmosphere,  light  reflecting  from  scene  surfaces, 
and  the  film  and  digitization  processes  that  result  in  the  com¬ 
puter  representation  of  the  image. 

These  models,  when  instantiated  for  a  given  scene,  provide 
us  with  the  desired  “physical"  sketch.  We  are  assembling 
a  “constraint-based  stereo  system"  that  can  use  this  physical 
sketch  to  resolve  the  ambiguities  that  defeat  conventional  ap¬ 
proaches  to  stereo  modeling  of  scenes  (e.g.,  urban  scenes  or 
scenes  of  cultural  sites)  for  which  the  images  are  widely  separated 
in  either  space  or  time,  or  for  which  there  are  large  featureless 
areas,  or  a  significant  number  of  occlusions. 

A  summary  of  some  of  our  current  work  in  the  area  of 
modeling  and  using  physical  constraints  is  presented  below. 

Rectilinear  Forme.  Images  of  cultural  scenes,  such  as  build¬ 
ing  complexes,  typically  contain  a  significant  amount  of  linear 
structure.  We  have  developed  an  effective  computational  tech¬ 
nique  for  recovering  3-D  interpretations  from  a  single  2-D  image 
in  many  such  cases.  It  works  by  finding  a  basis  for  a  vector 
space  suitable  for  quantifying  spatial  relations,  while  satisfying 
constraints  imposed  by  linear  features  in  the  image.  Given  three 
image  lines  that  are  assumed  to  be  perspective  projections  of 


mutually  orthogonal  scene  features,  the  method  backprojects  the 
lines  into  three-dimensional  scene  space,  generating  (potentially) 
all  possible  combinations  of  line  orientations.  It  selects  the  com¬ 
bination  that  is  ‘most  orthogonal"  by  maximizing  the  triple 
product  of  three  unit  basis  vectors,  using  the  method  of  steepest 
descent.  In  general,  two  solutions  are  found,  and  the  correct 
one  can  be  chosen  by  relating  the  solutions  to  knowledge  of  the 
vertical  direction.  A  more  complete  description  of  this  work  is 
presented  in  Barnard  (1981a). 

Inductive  Approach.  The  technique  discussed  above  has  led 
us  to  investigate  a  new  class  of  computational  methods  for  the 
interpretation  of  single  images.  These  methods  constitute  an 
inductive  approach  because  they  explicitly  recognize  that  the 
available  data  (the  image)  are  insuffiru  nt  to  make  a  well-founded 
logical  interpretation;  that  is,  many  interpretations  are  possible, 
but  only  one  is  preferred.  Specific  prior  models  cannot  account 
for  the  general  power  of  such  perception  in  the  case  of  a  human 
observer  (although  prior  models  are  certainly  used  when  avail¬ 
able).  To  be  truly  general  purpose,  machine  vision  must  be  able 
to  mimic  this  amazing  human  ability.  The  inductive  approach 
selects  interpretations  that  are  “simplest"  in  some  sense.  While 
it  does  not  preclude  the  use  of  specific  prior  models,  it  emphasizes 
the  use  of  abstract  generic  models,  such  as  parametric  curves 
and  surfaces.  One  measure  of  simplicity  we  have  considered  is 
based  on  information-theoretic  considerations.  This  work  will 
be  described  in  a  report  by  Barnard  (1981b). 

Optic  Flow.  In  the  optic  flow  paradigm,  a  moving  observer 
is  normally  able  to  interpret  a  time  sequence  of  images  as  an 
implicit  description  of  a  static  scene.  In  principle,  the  images 
can  be  matched  point-by-point  and  the  motion  of  the  observer 
can  be  deduced  by  exploiting  the  constraint  that  the  scene  is 
fixed.  In  practice,  this  is  exceedingly  difficult  to  achieve,  both 
because  point  features  are  rare  and  because  the  methods  are  very 
sensitive  to  small  matching  errors. 

We  have  developed  an  alternative  optic-flow  method  that 
exploits  the  often  available  information  about  the  rotation  of  the 
observer.  Knowing  the  observer's  rotation  greatly  simplifies  the 
problem  of  matching  successive  images,  but,  since  all  the  useful 
information  that  can  be  derived  from  the  sequence  is  due  to 
the  translation  of  the  observer,  it  does  not  significantly  sacrifice 
generality.  The  major  advantage  to  translation-only  optic  flow 
is  that  curvilinear  image  features  can  be  matched  by  exploiting  a 
constraint  that  is  essentially  the  same  as  the  epipolar  constraint 
in  stereo  interpretation.  This  work  is  still  in  progress. 

Spatial  Reasoning  from  Line  Drawing*  of  Polyhedra. 

Construction  of  a  three-dimensional  “sketch"  is  one  task  faced 
by  the  user  of  an  interactive  image  understanding  expert  system. 
An  urban  scene  typically  contains  buildings  and  other  objects 
that  can  be  modeled  as  planer-faced  polyhedra.  An  effective  way 
for  the  user  to  create  3-D  sketches  from  multiple  views  of  such 
objects  has  been  devised. 

The  system  requires  two  or  more  line  drawings  of  a 
polyhedral  scene  from  arbitrary  vantage  points.  These  line  draw¬ 
ings  may  be  obtained  from  a  freehand  sketch,  by  tracing  the 
edges  in  several  photos,  or  from  the  output  of  an  automatic  edge 
detector.  A  “wireframe"  model  of  the  objects  is  obtained  by 
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back-projecting  the  line  drawings.  Labels  of  solid  or  vacant  space 
are  then  assigned  to  all  spatial  regions  defined  by  the  wireframe 
using  an  iterative  constraint  propagation  algorithm.  The  result 
is  a  data  structure  that  captures  the  volumetric  structure  of  the 
objects  depicted,  which  can  then  be  used  to  support  hidden-line 
elimination  and  other  volumetric  operations  upon  the  object. 
This  work  is  described  in  Strat  [1984a]. 

Determining  The  Imaging  Geometry  from  a  Camera 
IVansformation  Matrix.  Many  scene  analysis  algorithms 
require  knowledge  of  the  geometry  of  the  image  formation 
process  as  a  prerequisite  to  their  application.  When  the  imag¬ 
ing  situation  can  be  controlled  or  measured  directly,  the  needed 
parameters  can  be  determined;  however,  in  the  case  of  un¬ 
calibrated  images,  or  photographs  whose  history  is  unknown, 
the  necessary  parameters  are  not  available.  In  these  cases,  an 
alternate  method  of  inferring  the  imaging  situation  from  the 
correspondence  between  a  small  set  of  image  and  object  points 
is  required. 

One  approach  has  been  to  compute  the  imaging  geometry 
directly  from  the  constraints  provided  by  the  known  data  points. 
Partial  information  such  as  the  camera's  focal  length  or  the  loca¬ 
tion  of  the  piercing  point  in  the  image  can  be  used  to  reduce  the 
number  of  data  points  needed.  A  second  approach  consists  of 
two  steps.  First,  the  known  data  points  are  used  to  compute 
a  4x4  homogeneous  coordinate  transformation  matrix  that  cap¬ 
tures  the  entire  transformation  from  object  point  to  image  point. 
An  established  technique  for  this  computation  involves  the  least 
squares  solution  of  a  set  of  simultaneous  linear  equations  from 
six  or  more  known  correspondences.  The  goal  of  the  second 
step  is  to  derive  the  various  parameters  of  the  image  formation 
process  from  the  transformation  matrix.  This  problem  can  be 
posed  as  a  system  of  nonlinear  equations  whose  solution  had  re¬ 
quired  iterative  methods.  Recently,  Ganapathy  [1984]  published 
the  first  noniterative  solution. 

Research  performed  at  SRI  has  also  produced  a  noniterative 
solution  (Strat  [1984b]).  By  reasoning  about  the  geometric  con¬ 
straints  inherent  in  a  camera  transformation  matrix,  a  simple, 
easily  understood  method  of  determining  the  various  parameters 
is  obtained.  Through  a  series  of  geometric  constructions,  the 
camera's  location  and  orientation,  along  with  the  piercing  point 
and  the  relations  between  the  focal  length  and  scale  factors,  can 
be  determined.  The  method  relies  purely  on  spatial  reasoning 
about  geometric  constraints  and  does  not  involve  an  intuitively 
opaque  matrix  decomposition.  Furthermore,  its  sensitivity  to  er¬ 
rors  can  be  studied  geometrically,  allowing  a  clear  understanding 
of  the  conditions  that  lead  to  inaccurate  decompositions.  The 
technique  has  been  successfully  applied  to  both  synthetic  imag¬ 
ing  situations  and  real  photographs. 

4.  STEREO  COMPILATION:  IMAGE 
MATCHING  AND  INTERPOLATION 

We  are  implementing  a  state-of-the-art  stereo  system  that 
produces  dense  range  images  given  pairs  of  intensity  images.  We 
plan  to  use  it  both  as  a  framework  for  our  stereo  research,  and 
as  the  base  component  of  an  expert  system  concerned  with  3-D 


compilation. 

There  are  five  components  of  this  stereo  system:  a  rectifier, 
a  sparse  matcher,  a  dense  matcher,  an  interpolater,  and  a 
projective  display  module.  The  rectifier  accepts  estimates  of 
the  parameters  and  distortions  associated  with  the  imaging 
process,  the  photographic  process,  and  the  digitization.  These 
parameters  are  used  to  map  digitized  image  coordinates  onto 
an  ideal  image  plane.  The  sparse  matcher  performs  two- 
dimensional  searches  to  find  several  matching  points  in  the  two 
images,  which  it  uses  to  compute  a  relative  camera  model.  The 
dense  matcher  tries  to  match  as  many  points  as  possible  in  the 
two  images.  It  uses  the  relative  camera  model  to  constrain  the 
searches  to  one  dimension,  along  epipolar  lines.  The  interpolater 
computes  a  grid  of  range  values  by  interpolating  between  the 
matches  found  by  either  the  sparse  or  the  dense  matcher.  The 
projective  display  module  allows  interactive  examination  of  the 
computed  3-D  model  by  generating  2-D  projective  views  of  the 
model  from  arbitrarily  selected  locations  in  space. 

The  current  system,  which  runs  on  the  VAX/11-780  in  C, 
is  described  in  Hannah  [1984].  At  present,  the  system  produces 
relatively  sparse  3-D  information,  even  in  its  dense  matching 
mode.  Often  3-D  data  are  required  that  are  more  closely  spaced 
than  can  be  provided  by  the  stereo  matching  process.  Further, 
there  may  be  areas  of  the  images  that  cannot  be  matched  due  to 
noise,  insufficient  information,  and  occlusions;  this  will  produce 
holes  in  the  dense  matched  data  that  must  be  filled  in.  In  either 
case,  interpolation  is  necessary  to  provide  3-D  data  between 
matched  points. 

Interpolation.  We  are  currently  exploring  two  different 
schemes  for  interpolation.  One  is  a  global  approach,  in  which  all 
of  the  3-D  information  available  is  used  to  find  the  interpolated 
value  for  a  given  point.  (This  approach  is  described  in  Smith 
[1984].)  The  second  approach  is  a  local  one,  which  only  uses 
the  data  in  the  neighborhood  of  the  point  to  be  interpolated. 
The  global  approach  produces  a  functional  description  that  can 
be  differentiated  analytically  to  determine  slope  and  other  sur¬ 
face  attributes;  the  local  approach  is  most  useful  in  the  context 
of  verifying  the  plausibility  of  the  matches  by  comparing  the 
data  from  the  stereo  images  after  projection  onto  this  surface. 
The  local  approach  is  being  used  in  the  context  of  a  hierarchical 
matching  scheme  described  below. 

Matching.  In  a  parallel  research  effort  employing  our  Lisp 
Machines,  we  are  exploring  a  hierarchical  technique  for  devel¬ 
oping  a  regular,  dense  grid  of  matched  points  This  technique 
does  appropriate  warping  of  the  images  between  each  level  of  the 
hierarchy,  to  account  for  differences  in  perspective  between  the 
two  images  as  predicted  by  the  model.  As  a  part  of  this  effort, 
local  interpolation  techniques  have  been  developed  to  fill  in  holes 
in  the  model  before  proceeding  from  level  to  level. 

The  Lisp  Machine  implementation  includes  a  sophisticated 
terrain  display  package,  which  permits  the  user  to  interactively 
designate  a  flight  path  through  the  3-dimensional  model  derived 
from  a  pair  of  images,  !  he  system  then  creates  a  “movie"  (a 
sequence  of  either  monocular  or  stereo  views)  of  the  terrain  as 
the  user  “flies"  along  the  path  above  the  terrain.  This  package 
is  useful  not  only  for  assessing  the  quality  of  the  derived  model, 


but  also  for  tasks  in  which  a  prediction  of  the  appearance  of 
the  scene  from  arbitrarily  specified  points  of  view  is  desired,  as 
when  an  observer  is  moving  through  mapped  terrain.  This  work 
is  described  in  Quam  [1984], 

Evaluation.  We  now  have  available,  on  our  VAX  (Testbed) 
and  Lisp  Machines,  some  of  tbe  most  advanced  stereo  match¬ 
ing  systems  developed  by  the  III  community.  As  a  part  of  our 
stereo  research  effort,  we  plan  to  run  several  calibrated  data 
sets  through  these  systems  to  determine  the  relative  strengths 
and  weaknesses  of  the  various  methods,  including  area  cor¬ 
relation,  hierarchical  warped  matching,  edge  matching,  and 
edge/iutensity  matching. 

5.  THE  REPRESENTATION  OF 
NATURAL  SCENES 

Our  current  research  in  this  area  addresses  two  related  prob¬ 
lems:  (1)  representing  natural  shapes  such  as  mountains,  vegeta¬ 
tion,  and  clouds,  and  (2)  computing  such  descriptions  from  image 
data.  The  first  step  towards  solving  these  problems  is  to  obtain 
a  model  of  natural  surface  shapes. 

A  model  of  natural  surfaces  is  extremely  important  because 
we  face  problems  that  seem  impossible  to  address  with  stan¬ 
dard  descriptive  computer  vision  techniques.  How,  for  instance, 
should  we  describe  the  shape  of  leaves  on  a  tree?  Or  grass?  Or 
clouds?  When  we  attempt  to  describe  such  common,  natural 
shapes  using  standard  representations,  the  result  is  an  unrealis¬ 
tically  complicated  model  of  something  that,  viewed  introspec- 
tively,  seems  very  simple.  Furthermore,  how  can  we  extract  3-D 
information  from  the  image  of  a  textured  surface  when  we  have 
no  models  that  describe  natural  surfaces  and  how  they  evidence 
themselves  in  the  image?  The  lack  of  such  a  3-D  model  has 
restricted  image  texture  descriptions  to  bring  ad  hoc  statistical 
measures  of  the  image  intensity  surface. 

Fractal  functions,  a  novel  class  of  naturally  arising  func¬ 
tions,  are  a  good  choice  for  modeling  natural  surfaces,  because 
many  basic  physical  processes  (eg.,  erosion  and  aggregation) 
produce  a  fractal  surface  shape  and  because  fractals  are  widely 
used  as  a  graphics  tool  for  generating  natural-looking  shapes. 
Additionally,  in  a  survey  of  natural  imagery,  we  found  that 
a  fractal  model  of  imaged  3-D  surfaces  furnishes  an  accurate 
description  of  both  textured  and  shaded  image  regions,  thus 
providing  validation  of  this  physics-derived  model  for  both  image 
texture  and  shading. 

Progress  relevant  to  computing  3-D  information  from 
imaged  data  has  been  achieved  by  use  of  the  fractal  model.  A 
test  has  been  derived  to  determine  whether  or  not  the  fractal 
model  is  valid  for  a  particular  set  of  image  data,  an  empiri¬ 
cal  method  for  computing  surface  roughness  from  image  dal3 
has  been  developed,  and  substantial  progress  has  been  made 
in  the  areas  of  shape-from-texture  and  texture  segmentation. 
Chai  orientation  of  image  texture  by  means  of  a  fractal  surface 
model  has  also  shed  considerable  light  on  the  physical  basis  for 
several  of  the  texture-partitioning  techniques  currently  in  use 
and  made  it  possible  to  describe  image  texture  in  a  manner  that 
is  stable  over  transformations  of  scale  and  linear  transforms  of 


intensity. 

The  computation  of  a  3-D  fractal-based  representation  from 
actual  image  data  has  been  demonstrated.  This  work  has  shown 
the  potential  of  a  fractal-based  representation  for  efficiently  com¬ 
puting  good  3-D  representations  for  a  variety  of  natural  shapes, 
including  such  seemingly  difficult  cases  as  mountains,  vegetation, 
and  clouds. 

This  research  is  expected  to  contribute  to  the  development 
of  (I)  a  computational  theory  of  vision  applicable  to  natural 
surface  shapes,  (2)  compact  representations  of  shape  useful  for 
natural  si". fares,  and  (3)  real-time  regeneration  and  display  of 
natural  scenes.  Wc  also  anticipate  adding  significantly  to  our 
understanding  of  the  way  humans  perceive  natural  scenes. 

Details  of  this  work  can  be  found  in  Pentland  [1983  and 
1984). 

8.  FEATURE  EXTRACTION:  SCENE 
PARTITIONING,  AND  LABELING 

O'tr  efforts  in  image  partitioning  and  labeling  have  advanced 
along  two  fronts:  we  have  developed  a  goal-directed  texture- 
based  segmentation  algorithm  and  have  studied  know  ledge- based 
control  concepts  required  to  integrate  this  with  other  image 
feature-extraction  techniques. 

The  SLICK  goal-directed  segmentation  system  combines 
knowledge  of  taiget  textures  or  signatures  with  knowledge  of 
background  textures  by  using  histogram-similarity  transforms. 
Regions  of  high  similarity  to  a  target  texture  and  of  low 
similarity  to  any  negative  texture  examples  are  found.  This 
use  of  semantic  knowledge  during  the  segmentation  process  im¬ 
proves  segmenter  performance  and  focuses  segmentation  activity 
on  material  types  of  greatest  interest.  (The  system  can  also  be 
used  for  goal-independent  texture  segmentation  by  omitting  the 
similarity-transform  computations.)  Development  of  this  seg¬ 
mentation  technique  is  essentially  romplete;  all  that  remains  is 
to  integrate  it  into  the  more  general  feature-extraction  system 
described  below.  Performance  of  the  SLICE  segmentation  algo¬ 
rithm  is  documented  in  Laws  [1984]. 

Tbe  KNIFE  (knowledge-based  interactive  feature-extraction) 
system  is  intended  to  solve  problems  in  image  segmenta¬ 
tion,  feature  extraction,  material  identification,  and  feature 
classification,  (Image  segmentation  and  feature  extraction  parti¬ 
tion  an  image  into  meaningful  units;  material  identification  and 
feature  classification  label  those  units.)  Experience  has  shown 
that  these  tasks  cannot  be  carried  out  adequately  in  isolauon. 
Image  segmentation,  for  instance,  cannot  produce  a  meaningful 
partitioning  unless  it  is  guided  by  semantic  criteria  from  material 
identification  and  feature  classification. 

The  KNIFE  feature-extraction  system  will  combine  a  data 
base  of  recognition  rules  (using  shape,  texture,  and  context) 
with  recursive  segmentation  and  other  techniques  to  find  and 
label  scene  features.  Initially  selected  image  r-gions,  based  on 
image  brightness  and  trxture.  are  resegmented  and  refined  to 
locate  recognizable  objects  (e  g.,  roads,  fields,  and  buildings) 
The  control  process  assigns  initial  labels  for  each  region,  and 
then  recursively  analytes  those  regions  that  might  contain  useful 
substructure.  The  choice  of  regions  to  split  or  merge  is  influenced 


by  analysis  goals  rather  than  solely  by  statistical  properties  of 
the  image  data.  The  segmentation  and  interpretation  will  thus 
proceed  at  unequal  rates  or  to  different  depths  in  separate  scene 
regions,  with  differing  types  of  knowledge  applied  at  successive 
stages  in  the  analysis.  Objects  detected  by  other  means  (user 
interaction  or  direct  object  recognition)  may  override  the  normal 
interpretation  cycle. 

We  are  concentrating  our  development  efforts  on  goal- 
directed  recursive  segmentation  and  on  related  display,  query, 
and  editing  tools.  Among  these  tools  are  display  of  input 
images  and  segmentation  maps;  readout  of  region  descriptions 
and  relationships;  and  commands  for  interactively  designating, 
splitting,  merging,  and  classifying  regions. 

The  control  process  is  a  production  system  that  looks  for 
applicable  rules  in  the  rule  base.  Such  rules  will  be  placed  on 
a  prioritized  queue  of  tasks  to  be  performed.  When  executed, 
they  may  query  the  user,  invoke  image  analysis  subsystems,  or 
affect  the  behavior  of  the  control  process  itself. 

Hesides  the  rule  base  and  the  input  or  derived  imagery,  the 
system  will  have  two  principal  data  structures.  These  are  the 
sketch  data  base,  and  the  prototype  data  base. 

The  sketch  data  base  serves  as  the  system  blackboard,  stor¬ 
ing  all  the  information  relevant  to  the  current  image.  The 
prototype  data  base  will  be  a  semantic  network  with  nodes  stor¬ 
ing  object  properties  and  pointers  to  image  examples. 

The  system  is  being  developed  on  tbe  VAX-based  SRI  Image 
Vnderstanding  (II  )  Testbed.  The  basis  for  the  system's  data 
analysis  capabilities  will  be  the  body  of  software  currently  ac¬ 
cumulated  in  the  testbed  and  other  programs  now  being  devel¬ 
oped.  such  as  the  SLICK  goal-directed  segmentation  system  dis¬ 
cussed  above. 


7.  LINEAR  DELINEATION  AND 
PARTITIONING 

A  basic  problem  in  machine-vision  research  is  how  to 
produce  a  line  sketch  that  adequately  captures  the  semantic  in¬ 
formation  present  in  an  image.  (For  example,  maps  are  stylized 
line  sketches  that  depict  restricted  types  of  scene  information.) 
Before  we  can  hope  to  attack  the  problem  of  semantic  inter¬ 
pretation,  we  must  solve  some  open  problems  concerned  with 
direct  perception  of  line-like  structure  in  an  image,  and  with 
decomposing  complex  networks  of  line-like  structures  into  their 
primitive  (coherent)  components.  Both  of  these  problems  have 
important  practical  ns  well  sis  theoretical  implications. 

For  example,  the  roads,  rivers,  and  rail-lines  in  aerial  images 
have  a  line-like  appearance.  Methods  for  detecting  such  struc¬ 
tures  must  be  general  enough  to  deal  with  the  wide  variety  of 
shapes  they  can  assume  in  an  image  as  they  traverse  natural 
terrain 

Most  approaches  to  object  recognition  drpend  on  using  the 
information  encoded  in  the  geometric  shape  of  the  coutours  of 
the  objects  W  hen  objects  occlude  or  touch  one  another,  decom¬ 
position  of  the  merged  contours  is  a  critical  step  in  interpreta¬ 
tion 

We  have  made  significant  progress  in  both  the  delineation 
and  the  partitioning  problems  Our  work  in  delineation  (Fischler 


and  Wolf  [1983])  is  based  on  the  discovery  of  a  new  perceptual 
primitive  that  is  highly  effective  in  locating  line-like  (as  opposed 
to  edge-like)  structure. 

One  approach  to  decomposing  linear  structures  into 
coherent  components  (Fischler  and  Bolles  [1983])  is  based  on  the 
concept  that  perception  is  an  explanatory  process  -  acceptable 
precepts  must  be  associated  with  explanations  that  are  believ¬ 
able;  They  must  be  complete  (i.e.,  they  explain  all  the  data), 
simple  (i.e.,  both  concise  and  of  limited  complexity),  and  stable 
(i.e.,  they  must  not  change  under  small  perturbations  of  either 
tbe  imaging  conditions  or  the  decision  algorithm  parameters). 

A  second  approach  to  the  partitioning  problem,  which 
also  addresses  the  problem  of  qualitative  matching  of  linear 
structures  (Smith  and  Wolf  [1984]),  focuses  on  the  concept  of 
simplicity  as  the  basis  for  making  perceptual  decisions.  Given 
a  set  of  primitives  as  the  basis  for  description,  each  possible 
description  of  a  set  of  data  is  evaluated  as  to  how  accurately 
it  describes  the  data  and  how  “long”  a  description  is  required 
(a  natural  conversion  from  accuracy  to  descriptive  length  is 
provided).  The  shortest  description  is  chosen  as  being  correct. 

These  new  delineation  and  partitioning  algorithms  have 
produced  excellent  results  in  experimental  tests  on  real  data. 
Our  continuing  work  in  this  area  focuses  on  theoretical,  as  well 
as  performance,  issues. 

8.  COMPUTING  ENVIRONMENT  FOR 
IU  RESEARCH 

Previous  reports  (e  g.,  Hanson  and  Fischler  [1981])  describe 
tbe  VAX  iJ/780  testbed  environment  we  created  for  evalua¬ 
tion.  demonstration,  and  transfer  of  111  technology.  A 
significant  recent  addition  to  this  system  is  based  on  the 
Symbolics  3600  LISP  machine.  Documentation  of  this  new  sys¬ 
tem  is  still  incomplete,  but  as  noted  in  section  four  of  this 
report  (Stereo  Compilation:  matching),  applications  recently- 
considered  beyond  the  state-of-the-art  on  comparably  priced 
hardware,  have  already  been  programmed  and  demonstrated 
(Quam  |198I]), 
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The  gap  between  the  image  arrays  and  the  high  level  descriptions  that 
are  required  for  many  visual  tasks  is  too  wide  to  be  bridged  in  a  single 
step.  Intermediate  steps  are  necessary,  leading  to  several  representations 
of  the  visible  world.  Our  work  to  date  has  focused  primarily  on  the 
initial  representations  of  low-level  vision  up  to  the  2.5 -D  sketch  that 
encodes  information  about  the  3-D  surfaces  and  their  properties.  We 
are  now  turning  our  efforts  to  the  integration  of  different  sources  of 
information  and  to  various  aspects  of  the  general  problem  of  deriving 
a  powerful  symbolic  representation  of  the  world  from  image  intensities. 
In  this  report  we  review  our  recent  work  an  early  vision,  starting  from 
the  perspective  of  regularization  analysis  which  is  providing  a  new  and 
powerful  theoretical  framework  fur  most  of  early  vision  In  particular, 
we  will  discuss  our  progresses  in  edge  detection,  multiple  scale  methods, 
computation  of  motion,  stereo  algorithms,  multigrid  algorithms  and 
surface  reconstruction  across  discontinuities  We  also  describe  some  of 
our  work  on  higher  level  vision,  including  shape  representation,  object 
recognition  and  the  analysis  of  spatial  relations  among  objects  and 
object  parts 


1.  Regularization:  the  New  Approach 

Until  recently,  researchers  in  vision  had  little  common  theoretical 
framework  to  call  upon.  It  is  true  that  significant  progress  has  been 
made  in  recent  years  toward  solving  some  problems  of  early  vision 
and  implementing  those  solutions  in  working  algorithms.  Ilut  the 
methods,  the  tools,  and  the  techniques  were  specific  to  each  problem 
and  had  to  be  invented  fresh  on  each  occasion. 

A  recent  theoretical  development  promises  to  improve  on  this 
situation.  We  now  believe  that  most  of  the  early  vision  problems 
can  be  "solved"  using  regularization  analysis.  Ibis  new  approach 
leads  to  a  specific,  powerful  class  of  algorithms  for  most  problems  in 
vision,  and  to  parallel,  fine-grained,  local  interconnections  hardware 
for  implementing  these  algorithms  efficiently  (Poggio  and  Torre,  1984; 
I’oggio  and  Koch,  1984).  As  we  will  sec,  regularization  analysis  is  far 
from  being  a  one-shot  solution  to  early  vision.  A  physical  analysis 
of  any  specific  problem  and  of  its  generic  constraints  arc  critically 
important.  Examples  of  domains  of  analysis  that  arc  required  and 
can  be  exploited  in  regularization  analysis  are  the  physics  of  image 
formation  and  multircsolution  analysis  of  images.  In  addition  any 
specific  module  of  early  vision  requires  an  analysis  of  its  specific 
physical  constraints. 

1.1.  Ill-posed  Problems  and  Regularization  Analysis 

In  1923.  Iladamard  defined  a  mathematical  problem  to  be  ill-posed 
when  its  solution 


•  Docs  not  exist. 

•  Is  not  unique. 

•  Docs  not  depend  continuously  on  the  initial  data,  or  said  another 
way,  the  solution  is  not  robust  in  the  face  of  noise. 

Most  early  vision  problems  are  ill-posed  in  Hadamard's  sense.  There 
arc  three  reasons  for  this.  First,  most  early  vision  problems  have  no 
unique  solution.  Second,  their  solutions  do  not  depend  continuously 
on  the  data.  And  third,  most  early  vision  problems  are  inverse 
problems,  and  we  know  that  most  inverse  problems  arc  ill-posed. 
We  have  formally  shown  that  several  low-lcvcl  vision  problems  such 
as  edge  detection,  motion  measurement,  stereo  matching  and  surface 
interpolation,  are  ill-posed  in  Hadamard's  sense  (Poggio  and  Torre, 
1984). 

Our  optimism  about  the  prospect  of  solving  much  of  early  vision  flows 
from  recent  advances  that  American  and  Russian  mathematicians  have 
made  in  developing  rigorous  regularization  methods  for  “solving”  ill- 
posed  problems.  The  basic  idea  behind  these  regularization  techniques 
is  to  restrict  the  space  of  acceptable  solutions  by  choosing  a  function 
that  minimizes  an  appropriate  functional.  The  mathematics  involved 
in  regularizing  ill-posed  problems  leads  to  choices  that  depend 
fundamentally  on  a  physical  analysis  of  the  generic  constraints  on 
the  problem.  It  has  ong  been  recognized  that  the  identification  of 
appropriate  physical  constraints  is  a  necessary  prerequisite  to  the 
formulation  of  early  vision  problems  in  a  way  that  is  well-defined  and 
soluble.  In  fact,  some  vision  problems  such  as  shape  from  shading, 
surface  interpolation  and  motion  measurement  have  previously  been 
formulated  precisely  in  the  form  required  by  standard  regularization 
methods.  This  common  theoretical  framework  allows  us  to  apply 
these  rigorous  methods  to  many  other  ill-posed  problems  in  vision. 

1.2.  Standard  Regularization  Techniques  for  Early  Vision 

Early  vision  can  best  be  defined  as  inverse  optics.  Its  main  goal  can 
be  viewed  as  the  solution  to  inverse  problems.  For  example,  in  many 
problems,  one  seeks  a  solution  z.  given  data  p.  such  that  Az  =  y.  To 
apply  regularization  methods,  one  must  first  choose  a  set  of  norms  ||  || 
(usually  quadratic)  and  a  stabilizing  functional  ||/’*||.  The  problem  is 
then  restated  as  the  following  variational  problem:  find  a  solution  * 
such  that  functional  (1)  is  minimized, 

l|A*  -  w||*  +  M|/’*||,>  (i) 

The  first  term  captures  the  closeness  of  the  solution  to  the  data.  The 
second  term  captures  the  degree  of  regularization  of  the  solution, 
and  generally  embodies  the  additional  physical  constraints  necessary 
to  solve  the  problem.  The  regularization  parameter  \  controls  the 
compromise  between  these  two  factors.  The  regularization  techniques 
that  we  are  presently  extending  to  early  vision  provide  ways  to 
determine  the  best  X.  They  also  provide  results  about  the  form  of 
die  stabilizing  functional  T  that  ensure  uniqueness  of  the  result  and 
rapid  convergence  of  the  computation. 


Wc  have  recently  shown  that  our  work  on  the  problems  of  shape 
from  shading,  computation  of  motion,  and  surface  reconstruction 
can  be  reformulated  as  instances  of  the  main  regularization  method. 
Wc  will  discuss  some  of  these  problems  in  more  detail  in  the  next 
sections.  Wc  have  also  applied  regularization  methods  to  the  proolcm 
of  edge  detection,  obtaining  an  explicit  form  for  the  optimal  filter 
(Poggio  et  al.,  1984).  Wc  will  briefly  review  this  and  other  work  in 
the  next  section.  Other  problems  such  as  stereo  depth  determination, 
and  structure  from  motion  can  also  be  approached  in  terms  of 
regularization  analysis.  At  present  wc  are  working  on  several  of  these 
problems. 

In  summary,  the  concept  of  ill-posed  problems  and  the  associated 
regularization  theories  provide  a  powerful  theoretical  framework  for 
solving  many  of  the  problems  of  early  vision.  This  new  perspective 
justifies  the  use  of  specific  variational  principles  for  solving  certain 
problems  and  suggests  how  to  approach  many  other  early  vision 
problems.  Most  importantly,  it  provides  a  link  between  the  ill-posed 
nature  of  early  vision  problems  and  die  computational  structure  of  the 
algorithms  for  solving  them.  Wc  arc  exploiting  this  link  by  designing 
fine  grained  hardware  to  efficiently  implement  these  algorithms. 

So  far  wc  have  found  two  powerful  classes  of  algorithms  for  solving 
variational  problems  of  die  type  indicated  by  equation  (1).  Ihcy 
consist  of  filtering  operations  and  of  multigrid  mcdiods.  Multigrid 
methods  arc  a  general  and  efficient  method  for  solving  quadratic 
variational  problems  of  the  type  of  equation  (1).  When  die  data  are 
dense  and  given  on  a  regular  grid,  a  simpler  method  can  be  used, 
for  appropriate  boundary  conditions:  the  solution  can  be  computed 
by  convolving  the  data  y  with  a  suitable  filter  (see  Poggio  and 
Torre.  1984).  Thus,  the  structure  of  diesc  algorithms  leads  directly 
to  parallel,  fine-grained  hardware  with  local  interconnections  of  the 
sort  used  in  the  Connection  Machine.  Wc  arc  presendy  beginning  to 
explore  how  to  implement  regularization  methods  efficiently  ill  the 
Connection  Machine  architecture. 


2.  Edge  Detection 


Wc  have  recently  applied  regularization  techniques  to  another  classical 
problem  of  early  vision  -  edge  detection,  Edge  detection,  intended 
as  the  process  that  attempts  to  detect  and  localize  significant  changes 
of  intensity  in  the  image  can  be  regarded  as  a  problem  of  numerical 
differentiation  (Torre  and  I’oggio.  1984).  Notice  that  differentiation 
is  a  common  operation  in  early  vision  and  is  not  restricted  to  edge 
detection.  The  problem  is  ill-posed  because  the  solution  does  not 
depend  continuously  on  the  data. 

The  intuitive  reason  for  the  ill-posed  nature  of  the  problem  can  be 
seen  by  considering  a  function  /(x)  perturbed  by  a  very  small  "noise" 
term  (sinllx.  f(z)  and  /(i)  +  <  sin  Hi  can  be  arbitrarily  close  for 
very  small  c.  but  their  derivatives  may  be  very  different  if  tl  is  large 
enough.  T his  simply  means  that  a  derivative  operation  "amplifies" 
high-frequency  noise. 

In  1-1).  numerical  differentiation  can  be  regularized  in  the  following 
way.  Let  the  image  model  be  y,  =  /(x.)  +  <..  where  y,  is  the  data 
and  i,  represent  errors  in  the  measurements.  Wc  want  to  estimate 
Wc  chose  a  regularizing  functional  ||/’/!l  =  / (/"(x-))2</x.  where 
/"  is  the  second  derivative  of  /.  One  regularization  method  would 
be  to  find  a  function  /  that  minimizes  the  functional  ljP/[|.  This 
method  assumes  no  noise  in  the  data,  and  is  equivalent  then  to  using 
interpolating  cubic  splines  for  differentiation.  Another  regularizing 
method,  which  is  more  natural  since  it  takes  into  account  errors  in 
the  measurements,  leads  to  the  variational  problem  of  minimizing 
(see  Rhcinsch.  1967) 

£(».-  /(*.))’  *  X  f 


Poggio.  Voorhccs  and  Yuillc  (1984)  have  shown  (a)  that  die  solution 
of  this  problem  /  can  be  obtained  by  convolving  the  data  y,  (assumed 
on  a  regular  grid  and  satisfying  appropriate  boundary  conditions) 
with  a  convolution  filter  H.  and  (b)  that  die  filter  H  is  a  cubic  spline 
with  a  shape  very  close  to  a  Gaussian  and  a  size  controlled  by  the 
regularization  parameter  X.  Differentiation  can  then  be  accomplished 
by  convolution  of  the  data  with  the  appropriate  derivative  of  this 
filter.  The  optimal  value  of  X  can  be  determined  for  instance  by 
cross  validation  and  other  techniques.  This  corresponds  to  finding 
the  optimal  scale  of  die  filter  (see  Poggio  and  Torre,  1984). 

These  results  can  be  directly  extended  to  two  dimensions  to  cover 
both  edge  detection  and  surface  interpolation  and  approximation. 
Ihc  resulting  filters  are  very  similar  to  two  of  the  gaussian-based 
edge  detection  filters  derived  and  extensively  used  in  recent  years 
(Marr  and  Hildreth.  1980;  Canny.  198.1;  sec  Torre  and  Poggio.  1984). 
The  present  derivation  is.  however,  more  general  and  rigorous:  the 
filter  follows  naturally  from  regularizing  die  ill-posed  problem  of 
numerical  differentiation  for  regularly  sp;iccd  image  data. 

In  the  area  of  edge  detection,  the  problem  of  which  differential 
operators  should  be  used  after  the  filtering  operation  has  been  analyzed 
theoretically  and  with  computer  experiments  by  Torre  &  Poggio 
and  co-workers  (1984).  In  particular,  they  have  derived  relationships 
among  several  2-1)  differential  operators  and  characterized  the  relation 
between  die  I  aplacian  and  the  second  directional  derivative  along  the 
gradient.  In  addition  diey  have  studied  the  properties  of  the  critical 
points  oT  the  differential  operators  and  characterized  the  geometrical 
and  topological  properties  of  the  zero-crossings  (and  level -crossings) 
of  differential  operators  in  terms  of  transvcrsality  and  Morse  theory. 

3.  Multiple  Scale  Analysis 


As  wc  have  seen,  differential  operations  on  sampled  images  require 
the  image  to  be  first  smoothed  by  filtering.  The  filtering  operation 
introduces  an  arbitrary  parameter  -  the  scale  of  the  filler,  c.g..  the 
standard  deviation  for  die  Gaussian  filter,  which  is  strictly  connected, 
as  wc  saw  in  the  previous  section,  with  the  regularization  parameter 
X.  In  computer  vision,  the  necessity  of  considering  several  scales 
of  filtering  was  realized  quite  early  on.  'Ihis  was  supported  by 
evidence  suggesting  die  presence  of  filters  of  several  sizes  in  the 
human  visual  system  (Roscnfcld.  1982).  More  recently.  Witkin  (1983; 
see  also  Stansficld,  1980)  introduced  a  scale-space  description  of 
zero-crossings  which  gives  the  position  of  die  zero-crossing  across  a 
continuum  of  scales,  i.c.,  sizes  of  the  Gaussian  filter  (parametrized  by 
the  s  of  the  Gaussian).  'Ihc  signal— or  the  result  of  applying  a  linear 
(differential)  operator  to  the  signal  —is  convolved  with  a  Gaussian 
filter  over  a  continuum  of  sizes  of  the  filter.  Zero-  or  level-  crossings 
of  the  filtered  signal  are  contours  on  the  x  -  a  plane  and  surfaces 
in  the  x,y,  a  space.  Witkin  proposed  that  this  concise  map  can  be 
effectively  used  to  obtain  a  rich  and  qualitative  description  of  the 
signal.  Yuillc  and  Poggio  (1981a.  1981b)  have  established  interesting 
relationships  between  multircsolulion  analysis,  the  Gaussian  filler  and 
zeros  of  differential  operators.  Their  main  results  arc  two  dicorcms: 

(a)  Zero-  and  level-crossings  of  an  image  filtered  through  a  linear 
differential  operator  of  die  Gaussian  filler  have  nice  scaling  properties, 
i.c..  a  simple  behavior  of  zero-crossings  across  scales,  with  several 
attractive  properties  for  further  processing.  Zero-crossings  arc  not 
created  as  the  scale  increases.  The  nice  scaling  behavior  is  a 
characteristic  property  of  the  Gaussian  filter  and  only  of  the  Gaussian 
filter  (see  also  Habaud.  Witkin  and  l)uda,  1981). 

(b)  The  map  of  the  zero-crossings  across  scales  determines  the  signal 
uniquely  Tor  almost  all  signals  in  die  absence  of  noise.  The  scale 
maps  obtained  by  Gaussian  fillers  is  dius  a  complete  representation  of 
die  image.  This  result  applies  to  level-crossings  of  any  arbitrary  linear 
differential  operator  of  the  Gaussian,  since  it  applies  to  functions  that 
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obey  (lie  (lUFusion  equation  (die  /eto  ernssings  arc  [lien  a  unique 
chai.iilai/.ition  modulus  (lie  null  spate  of  the  diflcrciuial  operators 
and  pros  ided  there  arc  at  least  two  /ero-crossings  contours). 

The  first  result  sheds  sonic  light  on  the  properties  of  /ero-crossings 
and  level-crossings  .it  different  scales,  assuming  the  Gaussian  filter. 
It  also  supports  the  use  of  the  Gaussian  filter  in  a  mullircsolulion 
edge  detection  scheme.  The  second  result  implies  that  scale-space 
fingerprints  are  complete  primitives,  that  capture  the  whole  information 
in  the  signal  and  characterize  it  uniquely.  Reconstruction  of  the  signal 
is.  of  course,  not  the  goal  of  early  signal  processing.  Symbolic 
primitives  must  be  extracted  from  the  signals  and  used  for  later 
processing.  Subsequent  processes  can  therefore  work  on  this  more 
compact  representation  instead  of  the  original  signal  (sec  Asada  and 
llrady.  1984). 

The  second  theorem  has  theoretical  interest  in  that  it  answers  the 
question  of  what  information  is  conveyed  by  the  edges  identified  with 
zero-  and  level-crossings  of  mulliscale  Gaussian  filtered  signals.  It 
is  furthermore  interesting  that  tins  complete  representation  happens 
to  coincide  with  the  basic  scheme  for  edge  detection  discussed 
earlier.  From  this  point  of  view  it  can  be  argued  that  the  fingerprint 
representation  makes  explicit  exactly  die  information  that  is  needed 
on  physical  grounds,  i.c.,  it  makes  explicit  edges  in  die  image.  We 
ate  now  attempting  (IT  Voorhccs.  I .  Hoggio  and  A.  Yuillc)  to  attack 
again  the  problem  posed  by  the  primal  sketch  —  of  labeling  changes 
in  intensity  in  terms  of  the  underlying  surface  properties  —  exploiting 
the  use  of  fingerprints.  I  hc  idea  is  to  identify  a  small  number  of 
primitive  image  intensity  features  —  such  as  step  edges  and  roof 
edges  —  and  partially  label  them  in  terms  of  the  properties  of  (he 
underlying  physical  surfaces,  distinguishing  for  instance  shadows  from 
occlusion  boundaries.  The  initial  success  of  a  similar  attempt  in  the 
realm  of  2-1)  shape  representation  by  Asada  and  llrady  (1984).  which 
w'c  will  describe  later,  is  encouraging. 

(t  may  be  asked  at  Uiis  point  what  die  correct  sequence  is  for  the  two 
steps  of  differentiation  and  filtering.  For  linear  operators  die  order 
is  of  course  immaterial,  since  they  commute.  I  bis  is  not  the  ease 
for  nonlinear  operators,  such  as  the  directional  derivative  along  the 
gradient.  I  he  regularization  argument  for  the  filtering  step  implies 
that  filtering  at  one  scale  must  precede  die  dilferentiation  operation. 
Ihc  computation  of  different  scales  requires  filtering  at  a  range  of 
resolutions  after  dilferentiation.  Ihc  reason  is  that  the  theorems  of 
Yuillc  and  i’oggiu  (1983a,  1983b)  bold  true  even  for  the  identity 
operator,  but  arc  not  necessarily  valid  if  filtering  is  performed  before 
a  nonlinear  differentia)  operation.  In  particular.  Gaussian  scaling  after 
the  nonlinear  directional  derivative  along  die  gradient  docs  not  have 
a  nice  scaling  behavior.  Thus  filtering  as  a  regularizing  operator  must 
be  performed  first  at  one-  scale  and  filtering  at  dilferent  scales  must 
be  performed  after  the  dilfcrenlial  operation.  For  linear  differential 
operators,  this  is  equivalent  to  a  mulliscale  filtering  either  before, 
alter,  or  together  wtdt  the  differential  operation  (c.g.  the  l  aplacian 
of  the  Gaussian). 

4.  Computation  of  Motion 

In  the  area  of  visual  motion  analysis,  we  have  developed  a  zero- 
crossing  based  method  for  computing  die  projected  two-dimensional 
velocity  field  from  j  changing  image,  flic  method  allows  arbitrary 
three-dimensional  surfaces  mulct  going  general  motion  in  space.  The 
measurement  of  motion  is.  in  general,  an  ill-posed  problem,  because 
die  solution  is  not  unique:  there  arc  infinitely  many  lwo-diiii-nsiun.il 
velocity  fields  consistent  with  a  given  dynamic  image.  This  problem 
can  he  approached  using  the  standard  techniques  of  regularization 
analysis  mentioned  earlier  (sec  also  I’oggio  and  lone.  1984).  Hildreth 
1 1984)  lias  developed  a  velocity  field  algorithm  consisting  of  two  main 
steps:  (II  initial  motion  measinements  are  made  at  the  locations  of 
a  in  viossmgs.  and  provide  die  component  of  velocity  in  the  direction 
pcipendKol.tr  ii>  the  Iota)  orientation  ol  the  zero-crossing  contours 


(we  denote  this  component  by  die  function  v 1  (»).  where  a  is  a  curve 
parameter);  and  (2)  a  velocity  field  is  computed  that  is  consistent 
with  u^a)  and  which  exhibits  the  least  amount  of  variation  along 
the  zero-crossing  contours.  In  particular,  the  velocity  field  V(a)  is 
computed  that  minimizes  the  measure  of  variation  given  by  the 
integral  along  die  contour  (this  is  an  instance  of  the  second 

regularization  method  described  in  I’oggio  and  Torre,  1984).  Kxccpt 
in  the  ease  of  an  infinite  straight  line,  there  exists  a  unique  velocity 
field  diat  minimizes  this  measure. 

This  velocity  field  algorithm  has  been  implemented  and  tested  on 
both  synthetic  and  natural  images.  The  synthetic  images  consisted 
of  two  types;  (1)  ideal  contours  undergoing  a  known  motion,  in 
which  die  measurements  of  u-1  (..)  were  computed  analytically,  and 
(2)  natural  images  undergoing  an  artificial  movement.  It  can  be  shown 
analytically  (Yuillc.  1983)  that  the  computed  velocity  field  of  least 
variation  will  be  equivalent  to  the  live  projected  velocity  field  when 
the  following  relationship  is  satisfied  along  the  contour; 


where.  T  is  the  local  tangent  vector  along  die  contour.  If  we  assume 
orthographic  projection  of  tlic  scene  onto  the  image  plane,  there  arc 
at  least  two  general  classes  of  motion  for  which  die  above  relationship 
is  satisfied:  (1)  pure  translation  of  arbitrary  objects  through  space, 
and  (2)  rigid  rotation  and  translation  of  three-dimensional  objects 
whose  edges  are  straight  lines.  F.mpitical  analysis  of  diese  classes  of 
motion  verifies  the  correctness  of  the  velocity  field  derived  from  this 
algorithm  (Hildreth  1984). 

For  the  ease  of  natural  motion  sequences,  it  was  found  that  there  can 
he  considerable  error  in  the  measurement  of  the  perpendicular  velocity 
components,  u-1  (»).  This  led  to  a  reformulation  of  the  algorithm 
in  such  a  way  that  the  computed  velocity  field  only  approximately 
satisfies  w  *■(»).  In  paiticular.  the  algorithm  minimizes  die  following 
expression; 

/ f  x/ lv  “  1  (■’)  -  f1!*)]^ 

where  u  1  (.-)  is  the  unit  vector  in  die  direction  perpendicular  to  the 
contour.  Ihc  first  term  describes  the  variation  in  the  velocity  field, 
and  the  second  describes  how  well  the  computed  velocity  field  satisfies 
the  image  measurements  given  by  v  L {■■>).  litis  formulation  of  the 
motion  measurement  computation  is  precisely  of  die  type  required 
by  standard  regularization  methods,  as  shown  earlier  in  equation  (l). 
h.mpir/c.tl  studies  have  shown  that  velocity  field  algorithms  derived 
from  this  formulation  .ire  far  more  robust  in  the  presence  of  error  in 
the  initial  motion  measurements  derived  from  the  changing  image, 
for  the  case  of  aerial  photographs,  the  entire  scene  can  be  treated 
essentially  as  a  single  surface,  undergoing  a  single  motion.  The  relative 
movement  of  objects  on  the  ground  is  very  small  compared  to  their 
overall  displacement  with  respect  to  the  airplane.  In  general,  however, 
a  scene  will  contain  multiple  objects  undergoing  different  motions,  with 
slurp  discontinuities  in  the  velocity  field  along  object  boundaries.  The 
detection  of  these  discontinuities  is  especially  important  for  algorithms 
such  as  the  one  described  here,  which  compute  .1  smoothly  varying 
pattern  of  movement.  We  have  begun  to  explore  possible  algorithms 
to  detect  motion  discontinuities  which  search  foi  sudden  changes  in 
the  sign  or  magnitude  of  the  initial  perpendicular  components  of 
velocity.  I  hese  algorithms  provide  an  indication  of  die  location  of  the 
discontinuities  prior  to  the  combination  of  the  measurements  of »» 1  (.1) 
to  compute  the  full  two-dimensional  velocity  held.  I  he  incorporation 
of  .1  discontinuity  detection  algorithm  with  the  subsequent  velocity 
field  algorithm  will  lead  to  a  more  robuM  measurement  of  motion 
for  scenes  with  multiple  objects  undergoing  different  movements. 


5.  Stereo  Algorithms 


Wc  arc  presently  developing  a  regularization  solution  to  stereo 
matching,  about  which  we  will  report  soon  (Yuille  and  Poggio).  Here 
we  will  describe  two  other  very  recent  approaches  to  the  problem  of 
stereo  matching  I  lie  first  algorithm,  which  is  a  considerable  evolution 
of  the  original  stereo  theory  of  Marr  and  Poggio  (1979),  has  been 
developed  by  Nislliliara  (1984)  with  the  goal  of  achieving  a  high 
speed  noise  tolerant  stereo  matcher  Ibis  matching  scheme,  based 
on  patchwisc  correlation  (between  filtered  images),  can  be  shown  to 
be  a  special  ease  of  a  more  general  variational  principle  that  can  be 
derived  with  standard  regularization  methods.  The  second  approach 
by  Yuille  and  Poggio  (1984)  leads  to  a  generalization  of  the  ordering 
const  ram!  (Baker.  1982)  that  captures  several  powerful  constraints  for 
solving  the  correspondence  problem  of  stereo. 

5.1.  A  Fast,  Noise-Tolerant  Stereo  System 

Nishihara  has  developed  an  approach  to  solving  the  binocular-stereo- 
matching  problem  which  places  special  emphasis  on  the  practical 
issues  of  noise  tolerance,  reliability,  and  speed.  It  is  strongly  influenced 
by  Marr  and  Poggio's  zero-crossing  theory,  but  differs  from  recent 
implementations  in  the  way  zero-crossing  information  is  used  to  drive 
the  matching  and  in  die  product  the  matcher  is  designed  to  produce. 

Four  design  objectives  have  guided  Nishihara’s  study.  'I he  first  is  noise 
tolerance.  Wc  want  to  understand  how  matching  can  lie  accomplished 
in  the  presence  of  moderate  to  large  noise  levels  which  occur 
anytime  surface  contrast  is  low  compared  with  sensor  and  inter-image 
distortions.  The  second  objective  is  to  achieve  competent  performance 
for  at  least  one  of  the  three  kinds  of  stereo  measurements — volume 
occupancy,  range,  and  location  of  elevation  discontinuities  (Nishihara 
and  Poggio.  1984).  The  third  objective  has  been  to  operate  at  a 
practical  speed  using  existing  computer  technologies.  The  emphasis  in 
this  work  has  been  to  streamline  the  compulation  to  increase  speed 
and  use  processing  resources  efficiently.  This  has  forced  a  careful 
consideration  of  die  rclauvc  cost  of  producing  a  measurement  in 
different  ways  vis-a-vis  its  contribution  to  the  final  product  of  the 
algorithm. 

With  these  design  considerations  as  a  base,  Nishihara  has  designed 
a  binocular  stcreo-matchmg  algorithm  for  making  rapid  visual  range 
measurements  in  noisy  images.  This  technique  is  developed  for 
application  to  problems  in  robotics  where  noise  tolerance,  reliability, 
and  speed  are  predominant  issues.  A  high  speed  pipelined  convolver 
for  preprocessing  images  and  an  unstructured  light  technique  for 
improving  signal  qqality  are  introduced  to  help  enhance  pcrfonnance 
to  meet  the  demands  of  this  task  domain  These  optimizations, 
however,  are  not  sufficient.  A  closer  examination  of  the  problems 
encountered  suggests  that  broader  interpretations  of  both  the  objective 
of  binocular  stereo  and  of  the  zero-crossing  theory  of  Marr  and 
Poggio  arc  required.  I  his  research  has  been  restricted  to  the  problem 
of  making  a  single  primitive  surface  measurement.  For  example,  to 
determine  whether  or  not  a  specified  volume  of  space  is  occupied, 
to  measure  the  range  to  a  surface  at  an  indicated  image  location, 
or  to  determine  the  elevation  gradient  at  that  position.  In  this 
framework  a  subtle  but  important  shift  is  made  from  the  explicit 
use  of  zero-crossing  contours  (in  band  pass  filtered  images)  as  the 
elements  matched  between  left  and  right  images,  to  the  use  of  the 
signs  of  the  convolution  between  zero-crossings.  With  this  change,  a 
simpler  algorithm  is  obtained  with  a  reduced  sensitivity  to  noise  and 
a  more  predictable  behavior.  The  PRISM  system  incorporates  this 
algorithm  with  the  unstructured  lifhi  technique  and  a  high  speed 
digital  convolver.  It  has  been  used  successfully  by  both  R.  Brooks 
and  K.  Ikcuchi  as  a  sensor  in  path  planning  and  bin  picking  systems 
respectively. 

Ihe  PRISM  system  incorporates  an  efficient  module  for  determining 
the  two-dimensional  displacement.  between  patches  out  of  the 
left  and  right  images.  I  lus  near/fat  module  docs  not  do  point  by 


point  search.  Instead,  a  single  correlation  measurement  is  made  at 
a  lest  disparity  (provided  as  input)  and  a  determination  is  made 
as  to  whether  die  correlation  peak  can  he  near  by  (within  half  the 
excitatory  diameter  of  the  IXJG  operator) .  If  there  is  a  positive  result, 
several  additional  correlation  measurements  arc  made  at  neighboring 
disparities  to  determine  die  shape  of  the  correlation  function  over  the 
test  disparity.  From  this  an  estimate  is  made  for  the  disparity  at  which 
the  correlation  peak  occurs.  The  name  of  die  module  comes  from 
die  work  of  G.  Poggio  and  Fisher  who  described  a  class  of  neurons 
in  primate  visual  cortex  sensitive  to  either  near  or  far  disparities. 

The  principal  surface  parameter  is  distance  from  the  cameras  which 
manifests  itself  as  a  translational  disparity  between  corresponding 
patches  from  the  two  images.  Wc  arc  also  able  to  corrctalc  against 
parameters  other  than  translational  disparity.  For  example,  an  elevation 
gradient  on  die  physical  surface  viewed  can  be  measured  by  correlations 
against  the  compressive  and  shear  distortions.  T  hese  distortions  are 
introduced  between  the  left  and  right  images  by  horizontal  and 
vertical  elevation  gradients. 

5.2.  A  Generalized  Ordering  Constraint  for  Stereo 

T  he  problem  of  stereo  matching  is  ill-posed  and  undcrdctcimincd: 
constraints  are  needed  to  make  the  solution  unique,  and  to  reduce 
the  search  problem  among  possible  matches. 

Marr  and  Poggio  (1979)  originally  identified  two  constraints:  (1) 
uniqueness,  that  is.  an  element  in  one  image  in  general  only 
corresponds  with  a  single  element  in  the  olliei  image,  and  (2) 
continuity,  that  is.  stereo  disparity  vanes  smoothly  almost  everywhere 
in  the  image.  T  hese  constraints  arc  powerful  because  diey  do  not 
depend  on  the  specific  properties  of  the  scene  but  on  general 
properties  of  the  stereo  geometry.  Marr  and  Poggio  (1979)  proposed 
a  stereo  matching  algorithm,  further  developed  by  Crimson  (1981, 
19X4),  which  incorporates  the  uniqueness  jnd  continuity  constraints 
to  match  zero-crossing  descriptions  computed  at  different  scales. 

An  ordering  constraint  along  eptpohtr  lines  has  been  exploited,  both 
implicitly  and  explicitly,  in  several  computer  algorithms  for  stereo 
matching  (for  example.  Baker  and  Biuford.  1981)  as  a  special  instance 
of  the  continuity  constraint.  Fpipolar  lines  in  the  two  images  arc  lines 
on  which  corresponding  points  lie  The  projections  of  a  point  P  in 
space  lie  on  the  plane  defined  by  P  and  the  two  camera  foci  and. 
as  a  consequence,  on  the  two  lines  defined  by  the  intersection  of 
this  plane  with  die  two  image  planes.  T  his  implies  that  the  matching 
problem  can  be  reduced  to  a  one-dimensional  search  if  the  epipolars 
arc  known.  Most  algorithms  assume  that  the  epipolar  geometry  is 
known  (from  a  known  camera  geometry)  and  that  the  images  arc 
registered.  Furthermore,  the  ordering  of  edges  or  other  features  is 
usually  preserved  by  stereo  projection  along  epipolar  lines  (Dial  is, 
if  feature  A  is  to  the  left  of  feature  B  in  the  left  stereo  image,  then 
this  spatial  relationship  is  maintained  in  the  right  stereo  image).  The 
ordering  constraint  along  epipolar  lines  follows  from  the  continuity 
of  surfaces  and  the  assumption  of  opacity.  As  originally  suggested 
by  Raker  ( 198?)  the  ordering  constraint  is  violated  in  some  situations 
(in  the  "forbidden  zone").  Ihe  forbidden  zone  associated  with  each 
point  of  the  visible  surface  is  a  set  of  points  in  space  that  would 
have  images  violating  the  ordering  constraint.  If  any  point  in  die 
forbidden  zone  would  be  connected  to  the  first  point  by  an  opaque 
surface  the  two  images  would  "see"  opposite  sides  of  die  surface. 

This  o hiding  constraint  can  be  exploited  to  reduce  the  complexity 
of  the  search  for  matching  features,  and  to  eliminate  false  matches. 
There  arc.  however,  situations  in  which  the  images  are  not  precisely 
registered.  Furthermore,  physical  edges  are  inherently  two-dimensional 
(a  property  that  is  not  exploited  by  the  epipolar  ordering  constraint). 
It  is  therefore  natural  to  ask  whether  the  ordering  constraint  can  be 
generalized  away  from  epipolar  lines.  Yuille  and  1‘oggio  (1984)  have 
shown  that  it  is  indeed  possible  to  generalize  die  ordering  constraint. 
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The  analysis  of  Vuille  and  Hoggin  considers  a  simple  sierco  geometry: 
it  begins  by  proving  a  simple  iclationship  between  the  two  images 
of  a  3D  curve  that  leads  to  a  gcncrali/ation  of  the  classical  ordering 
constraint.  This  relationship  allows  one  to  identify  special  points  in 
the  images  that  correspond  uniquely  to  the  same  physical  point  in  die 
object  curve,  lhc  Generalized  Ordering  Constraint  (GOO  implies 
several  of  the  specific  constraints  listed  by  Baker  ct  at.  (1983)  (see  the 
crossproduct  constraint).  Mnyhcw  and  l-'risby  (1981)  (see  their  figurai 
continuity  constraint),  and  Ohta  and  Kanade  (1983).  lhc  ordering 
constraint  breaks  down  when  the  object  curve  enters  the  forbulden 
zone.  I  rom  diis  analysis.  Yuillc  and  Hoggio  outline  an  algorithm 
based  on  matching  contours,  f  rom  a  single  contour  the  algorithm 
retrieves  the  viewing  parameters  and  unambiguously  matches  points 
along  the  contour  using  the  generalized  ordering  constraint.  Tiffany 
is  now  developing  an  algorithm  based  on  this  constraint. 

6.  Multigrid  Algorithmsfor  Regularization  Analysis 


6.1.  Background  and  test  report  of  surface  reconstruction 
algorithm 

We  have  investigated  die  computation  of  visible-surface  repre¬ 
sentations.  and  related  problems.  As  was  described  in  previous  reports, 
we  have  considered  die  efficient  reconstruction  of  visual  surfaces  from 
scattered  depth  constraints  using  multircsolution  iterative  processing. 
The  theoretical  basis  of  die  schemes  has  recently  been  tested  on 
several  sets  of  data  —  on  data  which  was  generated  syndiclically,  on 
natural  image  d.ita  preproccsscd  by  a  photometric  stereo  method  and 
two  stereopsis  algorithms,  as  well  as  data  from  Broil's  laser  rangefinder 
system  (Tcr/opoulos.  1984a).  The  results  of  this  extensive  testing, 
coupled  with  die  fact  that  the  surface  reconstruction  code  lias  been 
employed  in  other  research  projects  in  die  laboratory,  attests  to  the 
computational  efficiency  and  robustness  of  die  surface  reconstruction 
algoridim. 

6.2.  New  multiresolution  algorithms 

As  was  argued  by  several  authors  (Crimson,  1981:  Brady  and  Horn. 
1982  and  especially  Tcr/opoulos,  1984c).  many  early  vision  problems 
can  be  posed  as  variational  principles.  This  is  sublantialcd  further 
by  Hoggio  and  Torres  (1984)  view  of  problems  in  early  vision 
as  mathematically  ill-posed  problems  which  can  be  transformed 
into  well-posed  variational  principles  by  regularization  techniques. 
An  attractive  feature  of  these  types  of  variational  principles,  once 
discretized,  is  that  their  solutions  can  be  computed  by  local,  iterative 
algorithms,  which  can  tie  executed  by  many  processors  arranged 
in  locally-connected  networks  or  grids.  Given  only  local  processing 
capabilities,  however,  die  essential  global  properties  (e  g.  (piecewise) 
smoodincss.  consistency,  minimal  energy,  etc.)  of  the  desired  solutions 
must  be  satisfied  indirectly  by  propagating  visual  information  across 
grids,  through  iteration.  Substantial  computational  inefficiency  can 
result  since  grids  can  become  extremely  large  in  machine  vision 
applications.  Multircsolution  iterative  processing  can  overcome  this 
inefficiency,  as  demonstrated  by  the  application  of  mulligrid  mcdiods 
(llackbusch  and  Trollenbcrg.  1982)  to  visual  surface  reconstruction. 

Exploring  further  the  idea  of  applying  multigrid  methods  in  early 
vision.  Tcr/opoulos  (I984d)  has  developed  and  implemented  efficient 
multircsolution  iterative  algoridims  for  other  early  vision  problems; 
in  particular,  die  computation  of  lightness,  shape  from  shading, 
and  optical  flow  from  images.  lhcsc  new  algorithms  arc  based 
on  the  theoretical  work  of  Horn  and  his  colleagues;  however, 
there  arc  certain  interesting  novelties  aside  from  the  fact  that  the 
algorithms  compute  consistent  visual  representations  at  multiple 
scales.  Notably,  the  lightness  algorithm  is  locally  parallel  and  iterative 
whereas  Horn's  involved  convolution  with  a  (global-support)  Green's 
function,  die  shape  from  shading  algorithm  works  in  conjunction 
with  the  multircsolution  surface  reconstruction  algorithm  to  provide  a 


representation  of  the  surface  in  depth,  and  die  optical  flow  algorithm 
can  handle  flow  discontinuities  at  known  occluding  boundaries.  The 
multircsolution  algorithms  have  been  tested  on  synthetic  images,  and 
all  three  arc  significantly  more  efficient  than  die  single  level  versions 
(between  one  and  two  orders  of  magnitude). 

Ibis  approach  provides  a  general,  efficient  and  powerful  method 
for  solving  all  vision  problems  diat  can  be  formulated  in  terms 
of  quadratic  regularization  principles.  We  are  now  considering  the 
eventual  implementation  of  diese  multircsolution  algoridims  on  our 
Connection  Machine. 


7.  Regularization  Theory  of  Discontinuities 

As  we  discussed  in  section  1,  many  early  vision  problems  can  be 
characterized  as  ili-poscd  problems  and  treated  by  regularization 
methods  such  as  diat  represented  by  eq.  (I).  A  standard  class  of 
stabilizing  functionals  are  so  called  Tikhonov  stabilizers  that  typically 
impose  smoothness  conditions  on  the  possible  solutions  (e  g.,  restricts 
them  to  be  members  of  Sobolev  spaces).  This  is  sufficient  to  adequately 
solve  a  number  of  ill-posed  problems  in  physics,  and  it  applies  in 
an  approximate  sense  in  vision  inasmuch  as  the  coherence  of  matter 
tends  to  produce  smooth  surfaces  relative  to  the  viewing  distance 
at  certain  scales  (of  course,  regularization  dicory  is  not  restricted  to 
Tikhonov  stabilizers!). 

A  notable  complication  which  arises  repeatedly  in  early  vision 
problems,  however,  is  the  necessity  of  dealing  with  discontinuities. 
Discontinuities  arc  ubiquitous  at  all  stages  of  visual  processing  taking 
part  in  the  representation  of  images  to  the  representation  of  surfaces. 
The  stabilizing  functionals  used  so  far  in  early  vision  problems  cannot 
deal  adequately  with  discontinuities,  since  they  offer  no  control  over 
smoodincss.  Our  recent  work  by  Tcr/opoulos,  Marroquin  and  Hoggio 
is  aimed  at  a  formal  treatment  of  discontinuities  as  an  extension  of 
standard  regniarizaiion  analysis  in  early  vision. 

7.1.  Discontinuities  and  splines  under  tension 

Tcr/opoulos  is  extending  his  earlier  analysis  of  (he  discontinuity 
problem  in  computing  visible-surface  representations.  He  proposed 
as  a  physical  model  of  visual  surfaces  die  "thin  plate  surface  under 
tension. "  a  natural  two-dimensional  generalization  of  the  well-known 
spline  under  tension.  Since  surface  reconstruction  is  essentially  an 
ill-posed  problem  (the  available  visual  constraints  do  not  uniquely 
determine  die  surface,  see  Hoggio  and  Torre.  1984)  this  class  of 
surfaces  can  be  viewed  as  defining  a  stabilizing  functional  with 
smoothness  properties  dial  can  be  controlled  appropriately  at  depth 
and  orientation  discontinuities.  Details  and  generalizations  to  other 
ill-posed  vision  problems  arc  currently  under  analysis. 

7.2.  Beyond  Standard  Regularization  Methods 

A  related  approach  to  the  same  problem  is  being  followed  by 
Marroquin  and  Hoggio  (Marroquin.  1984).  and  is  based  on  the  work 
by  Gecinan  and  Gceman  (1984).  lhc  surface  to  be  reconstructed  is 
considered  as  a  sample  from  a  stochastic  process  based  on  a  Gibbs 
distribution  (that  is.  it  is  modelled  as  a  Markov  Random  Kicld). 

'  he  reconstruction  problem  is  then  equivalent  to  that  of  finding 
the  best  (Bayesian)  estimate  for  the  surface,  given  the  information 
provided  by  the  observations  (which  themselves  may  be  the  output 
of  some  process,  such  as  stereo  or  laser  ranging  techniques),  and 
(lie  a  priori  knowledge,  both  about  the  properties  of  the  surf.icc 
(piecewise  smoothness,  for  example),  and  about  the  geometry  of  the 
discontinuities,  if  they  are  present.  It  is  interesting  to  note  that  this 
construction  Icjds  to  a  mathematical  formulation  which  is  very  similar 
to  the  ones  obtained  from  the  standard  regularization  techniques, 
lhc  best  estimate  is  chosen  as  the  one  that  minimizes  a  certain 
"energy"  function  that  consists  of  the  sum  of  a  term  that  corresponds 
to  the  agreement  of  die  estimate  with  the  observations,  and  terms 


corresponding  to  the  constraints  imposed  hy  the  prior  knowledge 
about  the  nature  of  the  solution.  I  his  should  not  come  as  a  surprise, 
since  Bayesian  estimation  is  a  regularization  method:  die  functionals 
obtained  by  this  approach,  are.  in  general,  non-convex,  and  the 
computational  problem  of  finding  their  global  minimum  is.  therefore, 
a  complex  one. 

At  present,  we  have  applied  this  technique  to  the  problem  of 
reconstructing  a  piecewise  smooth  surface  from  sparse  and  noisy 
data,  and  we  have  obtained  encouraging  results  (using  synthetic 
dan).  The  algorithm  simultaneously  detects  tile  discontinuities  and 
interpolates  smooth  surfaces  across  the  appropriate  regions.  For  dense 
data,  the  same  algorithm  may  be  used  for  edge  detection  and  image 
segmentation  tasks. 

The  problem  of  global  minimi/ulion  of  the  corresponding  energy 
functionals,  is  currently  being  solved  using  the  method  of  simulated 

a . u<"ig.  a  stochastic  technique  recently  developed  by  Kiikpalrick 

et  al.  (1982).  which  is  effective,  but  computationally  expensive  (at 
least  on  a  serial  machine).  Currently,  research  is  progressing,  aimed 
at  trying  to  find  algorithms  that  arc  computationally  more  efficient, 
and  extending  the  range  of  applications  of  this  approach. 

8.  Shape  Representation  and  Object  Recognition 

At  a  higher  level  we  arc  actively  developing  several  approaches  to 
shape  representation  and  object  recognition,  'though  the  primary 
interest  is  in  the  field  of  robotics,  we  expect  that  our  research  in 
this  area  will  have  a  significant  fall-out  for  image  understanding  in 
general.  We  will  first  describe  the  approach  of  Brady  and  coworkers 
to  the  problem  of  computing,  representing  and  exploiting  2-0  and 
3-0  shapes.  We  will  then  outline  a  system  developed  for  object 
recognition  in  robotics  by  Brou.  We  will  conclude  with  an  algorithm 
for  recognizing  polycdral  objects  from  sparse  sensory  data,  due  to 
Grimson  and  l.ozano-Pcrcz. 

8.1.  Smoothed  Local  Symmetries,  the  Curvature  Primal 
Sketch  and  Reasoning  About  Shapes 

Brady  and  his  colleagues  have  investigated  the  representation  of  two 
and  three  dimensional  shape.  Smoothed  local  symmetries  (Brady  and 
Asada  1984;  Brady  1984)  represent  both  the  bounding  contour  and 
region  subtended  by  a  two-dimensional  shape.  Brady  and  Asada 
(1984)  have  developed  a  mathematical  analysis  of  the  smoothed  local 
symmelry  representation,  and  constructed  an  efficient  implementation. 
Intensive  experiments  have  been  carried  out  on  images  of  tools, 
leaves,  and  animals  in  determine  the  stability  and  sensitivity  of 
the  representation.  Brady  and  Asada's  implementation  performs  the 
following  operations:  first,  the  significant  changes  of  curvature  arc 
found  al  a  variety  of  scales;  second,  the  contour  is  approximated 
between  successive  curvature  changes  by  quadratics,  and  the  spines 
are  computed  between  pairs  of  quadratics.  T  he  spines  arc  displayed 
as  curves;  but  they  are  in  fact  represented  internally  as  descriptions 
of  the  parameters  proposed  by  the  mathematical  analysis. 

The  multiscalc  representation  of  significant  curvature  changes  is  called 
the  Curvature  Primal  Sketch  (Asada  and  Brady  1984).  Curvature 
changes  arc  detected,  localized,  and  assigned  a  symbolic  description 
by  the  multiscalc  patterns  of  curvature  and  curvature  change  peaks  to 
the  idealized  responses  of  a  set  of  models  that  includes  corner,  crank, 
and  smooth  join,  lire  models  arc  analogous  to  those  investigated  by 
Marr  (1976)  in  the  original  Primal  Sketch. 

Heide  (1984)  has  based  an  alternative  implementation  of  smoothed 
local  symmetries  on  an  efficient  algorithm  developed  by  Bookslcin 
for  the  symmetric  axis  transform.  He  has  developed  a  hierarchy  of 
symbolic  descriptions  of  a  shape  that  has  the  raw  parameter  values  of 
the  contour  and  region  at  the  lowest  level,  and  a  symbolic  description 
of  the  major  spines  and  contours  at  the  top  level.  Ihe  major  spines 
arc  found  by  smoothly  extending  spines  whose  descriptions  are 
sufficiently  similar,  and  by  subsuming  spines  whose  covers  arc  wholly 


contained  within  die  cover  of  some  other  spine.  An  intermediate 
level  provides  a  symbolic  description  of  all  the  primitive  spines. 
Iicide'i  tvpu  sc  illation  lias  blond  scope,  ii  icuuccs  to  iii.it  purposed 
by  llollcrbuch  (1975)  for  globally  symmetric  shapes  that  have  no 
attached  subparts. 

Although  there  are  many  geometrically  plausible  descriptions  of  any 
given  shape,  die  human  visual  system  is  usually  quite  definite  about 
the  one  that  is  perceived.  Indeed,  ambiguity  often  has  to  be  pointed 
out  for  us  to  realise  di.it  it  is  possible.  I  or  example,  a  square  that  has 
a  small  square  removed  from  one  of  its  corners  is  rarely  perceived  as 
an  (.-shape,  or  vice  versa.  Bagley  has  investigated  this  problem  for 
polyhedral  shapes  using  smoothed  local  symmetries  and  a  database 
of  models.  Ilis  program  generates  descriptions  dial  accord  well  with 
human  perception.  Metric  information  is  often  important  for  choosing 
between  alternative  descriptions  of  a  shape.  Bagley 's  program  is  also 
capable  of  describing  overlapping  shapes. 

The  various  implementations  of  smoothed  local  sy  mmetries  result  in  a 
scmantic-nctwork-likc  symbolic  description  of  a  shape.  We  have  begun 
to  investigate  the  usefulness  of  this  representation  for  perceptual  goals 
other  than  inspection  and  recognition.  In  particular,  we  have  interfaced 
smoothed  local  symmetries  to  Winston's  ANALOGY  program  ( Brady, 
Agre.  Braunegg.  and  Connell.  1984).  Ihe  resulting  program  can  be 
taught  to  recognise  simple  tool  shapes.  It  can  learn  to  recognise  that 
a  tool  is  a  hammer  yet  learn  that  there  is  a  functional  hierarchy 
of  hammers.  Our  goal  is  to  be  able  to  reason  about  objects  by 
relating  their  shape  to  dicir  function  (see  (Winston.  Binford,  Katz, 
and  Lowry.  1984)).  Suppose  one's  goal  is  to  drive  fine  tacks  into 
upholstery.  We  know  (somehow)  that  it  is  a  bad  idea  to  use  a  typical 
framing  hammer  for  the  job  because  it  tends  to  break  die  tacks 
and  destroy  the  furniture.  A  specialized  tool  called  a  tack  hammer 
has  been  invented:  but  one  might  not  be  available.  Based  on  a 
suitable  model  of  hammering  and  of  tool  shapes,  one  learns  die 
trick  of  grasping  a  screwdriver  by  its  blade  and  using  the  handle  to 
drive  the  lack.  Our  program  is  almost  capable  of  such  reasoning.  We 
propose  that  higher  order  geometric  structures  based  on  smoothed 
local  symmetries  directly  support  reasoning  between  structure  and 
function. 

We  have  continued  to  develop  a  representation  of  dircc-dimcnsional 
surfaces  (Brady  and  Yuillc  1984:  Brady.  Ponce,  Yuille,  and  Asada 
1985).  Ihe  work  has  both  a  theoretical  and  an  empirical  component. 
Hie  theoretical  component  is  a  study  of  classes  of  surface  curves 
as  a  source  of  constraint  on  the  surface  on  which  they  lie.  and  as 
a  basis  for  describing  it.  Brady.  Ponce,  Yuille,  and  Asada  (1985) 
analyze  bounding  contours,  surface  intersections,  lines  of  curvature, 
and  asymptotes.  They  develop  a  novel  proof  of.  and  extension  to, 
a  recent  result  due  to  Koendcrinck  dial  shows  that  the  sign  of  the 
Gaussian  curvature  of  the  surface  at  points  along  the  boundary  curve 
is  the  same  as  the  sign  of  the  curvature  of  die  projection  of  die 
boundary  curve. 

They  also  prove  a  theorem  about  generalized  cones  that  relates  surface 
curves  to  a  volumetric  representation  proposed  by  Marr  (1977).  Marr 
considered  generalized  cones  (Brooks  1981;  Brooks  and  Binford, 
1980)  with  straight  axes.  He  suggested  dial  such  a  generalized  cone  is 
effectively  represented  by  (i)  those  cross-sections,  called  skeletons,  for 
which  the  expansion  function  attains  an  extreme  value;  and  (ii)  the 
tracings,  called  J lutings ,  for  which  the  cross-section  function  attains 
an  extremum.  (A  tracing  is  the  space  curve  formed  by  a  point  of  the 
cross-section  contour  as  the  cross-section  is  drawn  along  the  axis.) 
Brady,  Ponce.  Yuillc.  and  Asada  (1985)  show  that  if  the  axis  of  a 
generalized  cone  is  planar,  and  the  eccentricity  of  the  cone  is  zero, 
then  (i)  a  cross  section  is  a  line  of  curvature  if  and  only  if  the 
cross  section  is  a  skeleton;  and  (ii)  a  tracing  is  a  line  of  curvature  if 
the  generalized  cone  is  a  tube  surface  (the  expansion  function  is  a 
constant),  or  the  tracing  is  a  fluting. 

The  experimental  work  investigates  whether  the  information  suggested 
by  the  theoretical  analysis  can  be  computed  reliably  and  efficiently. 
Brady.  Ponce.  Yuille,  and  Asada.  (1985)  demonstrate  algoridims 


th.it  compute  lines  of  curvature  of  .1  (Gaussian  smoothed)  surface: 
determine  planar  patches  and  umhtlic  legions;  extract  axes  of  surfaces 
of  resolution  and  tube  surfaces.  Hies  teport  preliminary  results  on 
adapting  the  eursamre  primal  sketch  algonthms  of  Asada  and  Itrady 
(19S4)  to  delect  and  describe  suiface  intersections. 

8.2.  Object  Representation  with  EGI 

We  h.tsc  used  the  Extended  Gaussian  Image  tl  Gl)  to  represent 
objects  and  obtain  dieir  orientation  111  depth  maps  (Itrou.  198.1).  Ibis 
approach  makes  it  possible  to  separate  tile  translation  and  rotational 
components  of  the  object  position,  the  Id  is  formed  by  mapping 
surl.ice  information  onto  a  unit  sphere,  the  xalue  assigned  u>  each 
of  the  sectors  ri  on  the  sphere  is  eauil  to  the  surface  area  of  the 
object  ss  a  1 1  normal  ?i.  Minkowski's  theorem  position  guarantees  that 
this  representation  is  unique  for  each  convex  object.  Unfortunately 
ambiguities  arise  when  the  object  is  tion-convex  but,  in  any  given 

application,  it  is  unlikely  that  two  non  eomex  objects  will  have  the 
value  I  Gl.  A  CM  >  system  based  on  constructive  solid  geometry  was 
built  to  describe  objects  to  the  machine  (Ibou.  19X1).  I  he  program 
obtains  the  polyhedral  representation  of  the  object  and  constructs  the 
I  Gl  automatically. 

Ihe  orientation  of  the  object  can  he  obtained  from  the  depth  map 
by  forming  an  I  Gl  of  the  depth  data  and  comparing  it  with  the  EGI 
of  the  objects. 

Mam  implementation  problems  had  to  he  addressed  before  the 
successful  implementation  of  the  comparator.  I  he  first  of  these 
is  the  internal  lepresenlalion  of  the  EGI.  Unlike  an  image  that 
is  conveniently  stored  in  a  two-dimensional  array,  the  problem  is 
now  one  of  sampling  anil  storing  data  obtained  on  a  sphere.  A 
geodesic  representation  derived  for  the  icosahedron  was  then  selected 
to  tessolate  the  sphere.  It  allows  arbitrarily  fine  subdivisions  and  is 
almost  uniform.  A  tesselation  with  ">()(>  triangular  patches  proved  to  be 
optimal.  I  he  triangles  arc  projected  onto  the  sphere  and  the  spherical 
patches  formed  hy  them  define  certain  ranges  of  orientations.  An 
algorithm  assigns  a  value  to  each  of  these  cells  by  computing  the 
surface  area  of  the  object  pointing  in  the  range  of  direction  defined 
by  the  cell. 

I  lie  second  important  problem  is  the  comparison  of  the  two  models, 
finding  the  orientation  of  the  object  in  the  image  is  equivalent  to 
identifying  the  relative  orientation  of  the  distributions  on  the  two 
spheres.  It  is  basically  a  correlation  problem,  hut  in  the  space  of 
rotations.  It  is  then  necessary  to  determine  the  rotations  that  will 
he  used  to  compare  the  models  I  his  was  solved  by  representing 
lot. Hums  as  quaternions  and  considering  mein  as  points  on  a  unit 
sphere  111  a  four  dimensional  space.  Hy  studying  regular  figures  in  that 
space,  ail  algorithm  was  developed  to  uniformly  sample  the  space  of 
lotations  with  arbitrarily  large  numbers  of  points.  A  final  verification 
was  done  to  guarantee  that  the  grids  of  the  tcsselaled  sphere  line  up 
for  each  of  the  rotations  in  the  set.  Unfortunately  this  means  that 
multiple  EGIS  of  each  object  are  required  (one  for  each  set  of  (10 
rotaiions). 

All  (lie  algorithms  were  tcslcd  with  real  and  artificial  data.  Ilic 
experiments  revealed  that  (lie  most  difficult  factor  to  deal  with  is  the 
si/e  of  the  space  of  rotations,  liven  when  about  6000  rotations  arc 
used  to  compare  the  two  models,  errors  of  If!  degrees  are  possible. 
Ibis  is  acceptable  for  some  pick  and  place  operations,  but  in  order 
to  obtain  the  high  level  of  accuracy  required  for  assembly,  hundreds 
of  thousands  of  rotations  would  be  required,  from  this  analysis,  it 
was  concluded  that  the  EGI  technique  can  still  he  used  to  obtain  an 
estimate  of  the  object's  orientation,  f  eature  based  techniques  would 
then  be  required  to  reduce  the  error  range  to  a  few  degrees. 

8.3.  Object  Recognition  from  Sparse  Visual  Data 

An  alternate  approach  by  Grimson  and  I  o/nno-Pcrcz  to  object 
recognition  is  aimed  at  using  very  simple,  sparse,  potentially  noisy 
sensory  measurements.  It  assumes  that  sensory  information  about  a 
scene  can  he  processed  to  obtain  .1  set  of  estimates  of  the  position 


and  surface  orientation  of  small  patches  on  a  suit. ice.  because  of 
the  simplicity  of  the  sensory  requirements,  many  different  sensing 
modalities  could  be  used  to  obtain  the  data.  In  order  to  recognize 
instances  of  known  objects  in  the  scene,  live  sparse  sensory  is  maichcd 
against  polyhedral  models  of  the  known  objects,  having  up  to  six 
degrees  of  freedom  relative  to  the  sensor  (three  transl.itixxn.il  and 
three  rotational).  We  stress  iliai  the  objects  need  not  be  themselves 
polyhedral,  only  that  they  can  be  so  modeled  to  within  some  bounded 
error. 

flic  approach  operates  by  examining  hypotheses  about  pairings 
between  sensed  points  and  object  surfaces.  Of  course,  examining  all 
possible  assignments  of  sensed  points  to  object  surfaces  is  infeasible 
for  all  but  trivial  cases,  flic  key  to  the  approach  is  identifying  simple, 
robust  constraints  that  will  effectively  and  efficiently  reduce  the  si/c 
of  the  portions  of  the  search  space  U1.1t  must  be  explicitly  explored. 
We  have  found  (Grimson  and  l.ozano-l’e'iez  84)  dial  a  very  effective 
set  of  coordinaic-framc-iiidcpciidcnt  constraints  can  be  derived  by 
considering  local  constraints  on:  (1)  distances  between  faces.  (2) 
angles  between  face  normals,  and  (})  angles,  relative  to  the  surface 
nonnals.tif  vectors  between  sensed  points.  These  constraints  turn  out 
to  be  very  efficient  in  reducing  die  number  of  feasible  interpretations 
of  the  sensory  data,  usually  to  a  unique  interpretation,  modulo  partial 
symmetries  of  die  object. 

The  constraints  have  several  advantages,  'fhey  arc  coordinatc-framc- 
independent,  so  that  die  object  is  recognized  independent  of  any 
peculiarities  of  die  sensor's  coordinate  system.  In  other  words,  the 
objects  are  recognized  by  matching  intrinsic  shape  characteristics. 
Only  after  all  feasible  interpretations  arc  found,  does  die  algorithm 
explicitly  solve  for  a  legal  transformation  from  model  coordinates 
to  sensor  coordinates,  thereby  localizing  die  object.  I  he  constraints 
also  demonstrate  a  strong  degree  of  robustness  to  noise,  degrading 
gracefully  as  the  sensor  measurements  tire  increasingly  perturbed.  We 
have  routinely  run  the  algorithm  successfully  with  errors  in  measuring 
surface  orientation  on  the  order  of  30“  and  with  errors  in  measuring 
surface  position  of  I  part  in  SO.  I  he  constraints  also  straight  forwardly 
extend  uv  the  case  of  recognizing  overlapping  or  partially  occluded 
objects.  As  well,  the  constraints  apply  both  to  the  simple  case  of 
isolated  objects  in  stable  positions  (three  degrees  of  frecdont)  and 
to  the  more  complex  case  of  arbitrarily  oriented  objects  (six  degrees 
of  freedom).  Ilccausc  of  die  simplicity  of  die  technique,  it  lias  been 
implemented  in  a  table-lookup  algorithm  that  is  quite  fast. 

The  algorithm  has  been  tested  on  both  synthetic  and  real  data.  Wc 
have  run  extensive  simulations  on  both  two  dimensional  and  three 
dimensional  objects,  under  a  variety  of  situations,  including  widely 
vary  ing  ranges  of  simulated  error.  I  he  success  of  diese  simulations 
in  identifying  objects  in  the  presence  of  noise  from  very  few  data 
points  has  been  further  supported  by  a  theoretical  combinatorial 
analysis  (Grimson  84).  Wc  have  also  successfully  tested  versions  of 
the  algorithm  on  real  data  obtained  from  grey  level  image  data  and 
laser  ranging  devices.  M.  1  )rumheller  has  applied  ii  to  noisy  and 
sparse  sonar  sensor  data. 

9.  The  Analysis  of  Spatial  Relations 

Today,  there  is  still  a  notable  lack  of  solid  methods  for  analyzing 
shapes  and  spatial  relations  in  images.  I  nr  example,  there  arc  few 
mechanisms  for  dealing  with  questions  about  spatial  relations  such  as 

“does  object  A  lie  within  object  H.”  “is  A  supported  from  below," 
"tan  A  lit  in  the  space  between  II  and  C."  The  ability  to  process  spatial 
information  efficiently  and  provide  answers  to  questions  regarding 
shape  properties  and  spatial  relations  is  crucial  fur  the  tasks  of 
object  recognition,  visually  guided  manipulation,  and  reasoning  about 
scenes. 

I11  addition  to  the  lack  of  suitable  algorithms,  standard  computer 
architecture  is  inappropriate  for  this  important  task,  and  Ibis  deficiency 


hinders  (he  interpretation  and  uu’  of  v isual  infomialion. 

I  he  prohlem  of  computing  general  shape  properties  and  spatial 
relations  among  objects  and  object  parts  is  a  novel  domain  of 
investigation.  Mahoney  and  I'llman  have  begun  a  study  of  the 
analysis  of  spatial  inhumation  and  development  of  new  algorithms 
for  the  computation  of  spatial  relations.  Mahoney  has  chosen  as  a 
specific  domain  for  this  project  the  problem  of  interpreting  terrain 
maps,  since  this  requires  the  computation  of  many  simple  and  often 
quite  difficult  spatial  relations  from  rather  simple  primitives,  mainly 
line  drawings.  We  expect  that  for  achieving  the  required  level  of 
performance  in  this  task  the  use  of  parallel  operations  would  he 
required.  Ihis  will  require  in  turn  the  use  of  new  types  of  processors, 
specialized  for  the  parallel  processing  of  spatial  information. 

Compulation  of  simple  spatial  relations  is  very  often  global,  as 
is  well  known  from  the  I’erreptronr-  work  of  Minsky  and  Papeit.  We 
discuss  the  prohlem  of  computing  spatial  relations  at  some  length  in 
the  following  subsections,  because  we  believe  it  is  a  critical  issue  for 
attacking  the  problems  of  intermediate  and  high  level  vision.  ITiey 
suggest  a  class  of  algoriduns  that  arc  not  of  the  regularizing  type  and 
wliu.li  lequire  a  Inudwaic  facility  ill  a  paiallei  aiclliiccvuic  —  such 
as  the  Connection  Machine  router  —  to  support  the  use  of  pointers 
and  non-local  connections. 

9.1.  Terrain  Maps  and  the  Study  of  Spatial  Analysis 

The  use  of  diagrams  is  remarkably  effective  in  a  wide  range  of 
human  problem-solving  situations.  Diagrams  provide  a  rich,  compact 
external  information  store,  from  which  visual  processes  rapidly  extract 
just  the  information  relevant  to  the  task  at  hand.  Moreover,  the  fact 
that  people  often  find  it  useful  to  draw  diagrams  in  the  course  of 
solving  a  problem  suggests  that  visual  processing  can  play  an  integral 
part  in  reasoning. 

Hie  use  of  terrain  maps  -  applied  to  navigation,  for  example 
-  provides  a  very  general  example  of  this  phenomenon,  in  the  sense 
that  maps  pose  most  of  the  visual  problems  found  in  a  range  of  other 
schematic  representations.  Notice  that  navigation  might  involve  not 
only  examining  a  map,  but  also  sketching  selected  aspects  of  it,  or 
registering  one  map  with  another. 

t  errain  maps  represent  the  physical  elements  of  the  landscape 
and  their  spatial  distribution.  The  descriptive  elements  of  a  map 
include  primarily  plane  curves  and  symbols.  Curves  arc  used  to 
represent  both  linear  geographic  features  -  such  as  rivers  and  roads  - 
and  the  Untmlaries  of  areal  geographic  features  •  such  as  landmasses 
and  political  boundaries.  Symbols  -  cartographic  and  alphanumeric- 
characters  -  arc  used  as  place  markers  for  geographic  features  that 
have  little  or  no  spatial  extent  at  die  scale  of  the  map  (like  summits), 
or  they  arc  used  to  label  other  items. 

liven  simple  use  of  a  map  requires  the  following  capabilities: 

1.  Distinguish  and  classify  individual  descriptive  elements,  l-'or 
example,  a  given  line  in  the  map  must  he  interpreted  as  a  river, 
road,  shoreline,  etc.  (  I  his  is  a  problem  in  segmentation/grouping 
and  recognition). 

2.  Interpret  combinaliims  of  descriptive  elements  as  representations 
of  geographic  items,  lor  example,  a  set  of  closely  spaced  contour 
lines  th.it  arc  nearly  circular  and  concentric  might  represent  a 
volcano.  ( Ihis.  also,  is  a  problem  in  grouping  and  recognition.) 

.1.  Interpret  references  to  locations  in  the  map.  There  arc  several 
different  coordinate  systems  defined  in  a  map  in  which  such 
references  might  be  expressed.  "North-west",  "top-left",  "P-16”, 

"l  atitude  4,  longitude  4".  and  "central  Africa"  arc  instances  of 
coordinates  in  different  map-based  reference  frames,  and  whose 
interpretation  depends  on  sophisticated  spatial  analysis. 

4.  Search  for  specified  markers,  descriptive  elements,  or  combina¬ 
tions  of  them.  livery  form  of  map  use  involves,  as  an  initial  step, 
finding  "it"  on  the  map.  where  "it"  can  be  an  arrow,  a  name,  or 
a  pattern  of  lines  with  specified  properties  -  like  a  volcano.  The 
utilitarian  nature  of  maps  makes  visual  search  a  very  prominent 


spatial  analysis  oper.ilion.  I  he  rich  context  present  in  a  map 
makes  the  search  problem  non  trivial,  especially  when  human 
performance  at  such  tasks  is  Liken  as  a  standard  for  efficiency. 

5.  Determine  spjtial  properties  of  the  items  in  the  map.  or  relations 
holding  betw  een  them.  I  he  interesting  properties  and  relations  in 
a  map  are  those  which  have  useful  correlates  in  the  geographical 
world.  Properties  of  and  relations  he'“e.-n  landscape  features 
can  map  directly  or  inditectly  to  two-dimensional  properties 
and  relations  m  tlie  map.  1  lie  length  of  a  liver  is  a  ease  .  f  the 
direct  case;  the  elevation  of  a  town  must  he  inferred  from  the 
nesting  of  the  marker  for  the  town  within  die  contour  lines. 
Typical  map  use  can  be  characterised  in  terms  of  two  interacting 
systems.  Above,  dierc  is  a  reasoning  system  engaged  in  a  task  which 
involves  knowledge  of  die  geographical  world.  At  its  disposal  is  a 
partial,  declarative  model  of  the  woild.  from  which  certain  inferences 
can  he  made,  lielow.  diere  is  a  spatial  analysis  system  whose  input  is 
an  image  of  a  map.  and  which  is  capable  of  resolving  spatial  queries 
given  by  the  reasoning  system,  through  the  coordinated  application 
of  die  capabilities  described  above.  I  he  answers  to  dicse  queries  are 
used  to  extend  or  correct  die  high-level  geographical-world  model. 
Our  eventual  goal  is  to  develop  such  a  spatial  analysis  system. 

Die  following  mock  "tour"  of  an  imaginary  map  illustrates  some 
of  die  requirements  on  a  spatial  analysis  system,  in  terms  of  a  few 
of  the  kinds  of  spatial  properties  and  relations  it  must  descrntinale. 
Such  a  sequence  might  be  given  to  a  map  user  on  the  other  end  of 
a  radio  link. 

1.  Tind  die  arrow  in  die  upper  left  section  of  the  map. 

2.  hind  the  town  pointed  at  by  the  arrow. 

3.  Visually  follow  the  road  the  town  is  on  to  the  second  river 
crossing. 

4.  Visually  follow  die  river  to  its  source. 

5.  hind  the  peak  of  die  mountain  in  which  the  river  has  it's  source. 

(>.  hind  Robot  State  hark,  which  is  due  north  of  this  peak. 

7.  l  ind  the  point  tnaiked  \  inside  the  stale  park. 

8.  What  is  marked  by  die  X? 

9.2.  A  few  elemental  spatial  operations  go  a  long  way 

Ihc  visual  routines  paradigm,  introduced  hy  I.  liman,  (imposes 
that  an  open-ended  set  of  abstract  spatial  properties  and  relations 
can  he  computed  by  intcgtating  basic  spatial  operations  taken  from 
a  small  fixed  set  into  specialized  proeeduies.  living  the  basis  of  most 
other  spatial  analysis,  these  elemental  operations  must  he  simple, 
general,  very  robust,  and  very  fast. 

We  It  ave  begun  to  study  visual  tasks  in  the  context  of  sc  nematic 
drawings,  with  a  view  to  die  following  issues; 

1.  What  are  the  objects  of  spatial  analysis,  and  what  are  the 
important  spatial  relations  and  piopcrlics  they  exhibit. 

2.  What  are  robust,  efficient  visual  routines  for  computing  these 
objects,  properties,  and  relations;  from  what  basic  operations 
arc  these  routines  composed. 

I  he  standards  for  efficiency  and  robustness  come  from  human 
performance  at  similar  tasks. 

3.  Ilow  should  die  processing  he  controlled  such  that,  in  a  real 
sisu.il  problem  solving  situation,  the  appropi i.ne  routines  are 
applied  productively. 

4.  Wli.n  sort  of  architecture  will  dU-ctively  support  die  suggested 
basic  operations  and  control  structure. 

Ullman  lias  made  a  number  of  suggestions  for  generally  useful 
basic  operations,  justified  primarily  on  computational  grounds.  These 
opciations  include  contour  tracking,  area  coloring,  ray  projection, 
marking  locations  for  later  processing,  shifting  the  processing  focus, 
and  indexing  to  locations  which  have  salient  piopcrties. 


Wc  have  begun  studies  of  visual  tasks  of  the  sort  dial  arise  in 
maps  and  diagrams,  in  order  to  further  define,  evaluate,  and  complete 
this  partial  set  of  basic  operations  and  the  processing  model  of  which 
they  are  a  part.  Initial  efforts  indicate  that  the  operations  listed  above 
can  account,  in  large  part,  for  a  wide  variety  of  the  needed  spatial 
relations. 

Current  efforts  arc  aimed  jt  (a)  determining  what  additional  basic 
operations  arc  required  and  (h)  designing  an  clficicnt  implementation 
for  die  basic  operations  dial  arc  being  used. 

9.3.  Pointers  are  useful  for  spatial  analysis. 

We  have  begun  to  consider  the  possible  implementations 
of  spatial  operations  on  the  Connection  Machine.  I  lie  following 
subsections  contain  initial  proposals  for  die  use  of  non-local 
interconnections  in  the  course  of  spatial  analysis. 

9.3.1.  Directing  Processing  To  Salient  Locations 

An  important  operation  in  spatial  analysis  is  die  intelligent 
sc1  'etion  of  locations  for  initial  processing.  I  ovations  that  are  prominent 
m  some  (selected)  property  or  feature— such  as  brightness  contrast, 
relative  motion,  or  the  presence  of  line  terminations,  crossings, 
or  curvature  changes— are  often  good  candidates.  We  believe  it  is 
impoitant  that  vision  hardware  be  able  to  test  for  prominence  in 
die  piopctv  of  inlciesi  at  all  locations  simultaneously:  dicn  further 
pioccssing  can  be  selectively  applied  to  those  locations  at  which  the 
test  is  successful. 

So-called  winner-take  all  mechanisms  have  been  proposed  to 
account  for  how  the  most  prominent  location  might  be  chosen  among 
all  of  those  having  a  given  property  (Koch  &  tollman.  1984).  One  way 
ol  accomplishing  this  with  non-local  communication  is  to  determine 
the  location  with  the  maximum  value  by  an  iterative  procedure 
much  like  an  auction.  When  all  hut  the  most  prominent  location  has 
dropped  out  of  the  biddi  .g,  the  processor  representing  this  location 
sends  its  address  to  the  mosl-prnmincnt-locntioit  cell  preassigned  for 
the  g  en  property.  Processing  of  this  location  may  now  proceed  by 
in. iking  use  of  the  address  this  special  cell  contains  (Koch  St.  Llllman. 
19,84). 

9.3  2.  Concurrent  Computation  of  Spatial  Relations 

On  machines  with  non-local  communication,  spatial  properties 
or  relations  may  be  computed  concurrently  in  a  single  application 
of  the  operation  if  the  relevant  scene  items  have  been  uniquely 
identified  first. 

t  or  example,  all  the  insidc/oulsidc  relations  occurring  among 
some  set  of  boundaries  and  figures  could  be  computed  at  once  by 
uniquely  coloring  the  interiors  of  all  boundaries  (which  colors  the 
figures  they  c  mium).  and  dicn  collecting  the  addresses  of  all  figures 
of  e.ich  color. 

9.3.3.  Linking  Symbolic  and  Iconic  Visual  Representations 

As  sp.iu.il  analysis  is  incrementally  applied  to  the  early,  more 
mink  leprcsentations,  more  symbolic  representations  of  portions  (or 
aspects)  uf  the  scene  arc  produced  It  is  useful  to  link  these  symbolic 
and  iconic  components,  for  each  scene  item,  using  pointers.  For 
example,  through  such  pointers  operations  at  die  symbolic  level  can 
mill  ile  iorlher  image  level  processing  when  necessary. 

9  3  4  Optimal  Algorithms  tor  Elemental  Spatial  Operations 

Coloring  (hi  region  growing)  is  one  important  class  of  elemental 
opcialions  1  sing  something  like  the  Connection  Machine  's  fixed 
M  connection  network,  coloring  can  be  implemented  with  run 
time  proportional  to  the  diameter  of  the  region  On  a  serial  architecture 
the  lime  is  prupoitionul  to  the  region  s  area. 

However  ilus  is  only  a  lower  bound  on  the  performance  wc 
seek  observations  of  human  vision  suggest  that  the  corresponding 
operations  take  roughly  constant  tune— up  to  quite  large  diameters. 
Wc  believe  that  such  performance  will  be  necessary  for  full-scale, 
real-time  vision 


Achieving  the  required  performance  for  these  operations  will 
invoice  discovering  novel  parallel  computation  structures.  Such 
obserc ations  have  an  lutectural  consequences  for  experimental  systems: 
using  the  Connection  Machine's  programmable  interconnect  will 
provide  an  experimental  medium  for  this  study. 

For  example,  coloring  can  be  speeded  up  dramatically— in 
terms  of  the  number  of  algorithmic  steps  it  takes— when  the  non-local 

communication  is  used  appropriately,  t  his  is  for  the  same  reason  that  a 
figure  can  he  painted  more  quickly  w  ith  a  large  brush  than  a  small  one: 
the  analogue  of  brush  si/e  on  machines  with  non-local  communication 
is  tire  si/c  of  the  pixel  neighborhoods  that  are  turned  on  at  each 
step.  Again  such  observations  have  architectural  consequences  for 
experimental  systems:  using  the  Connection  Machine's  programmable 
interconnect  will  enable  us  to  experiment  w  ith  various  brush  si/cs. 

Systematic  and  extensive  simulations  can  lead  to  an  understanding 
of  how  die  optimal  si/c  and  shape  of  the  brush,  in  this  example,  arc 
constrained  by  particular  kinds  of  input  figures.  (A  tree  might  require 
different  strategies  lor  fast  coloring  than  the  disc  of  the  moon.) 
The  advantage  of  the  Connection  Machine  as  a  simulation  tool 
comes  not  only  from  speed—  programmable  interconnections  provide 
a  substantial  conceptual  advantage  In  die  design  and  simulation  of 
novel  parallel  algorithms.  Similar  advantage  has  been  experienced  in 
die  use  of  high-level  languages,  or  in  the  gradual  adoption  of  what 
were  once  strictly  language-level  concepts  in  the  design  of  modern 
conventional  ,rchitccturcs.  Naturally,  die  benefits  will  he  even  greater 
as  effective  programming-language  concepts  are  developed. 


10.  Conclusions 


In  this  report  wc  have  reviewed  some  of  our  work  over  the  last 
year  on  Vision  and  Image  Understanding.  Our  effort  spans  various 
levels.  It  begins  with  the  early  problems  of  compuung  surface  distances 
and  other  surface  properties  (the  2f-l)  sketch).  We  have  analyzed 
in  depth  the  problem  of  edge  detection,  multiple  scales,  stereo  and 
motion.  Wc  have  implemented  algoridims  for  solving  these  problems 
efficiently.  We  now  plan  to  refine  our  analysis  of  the  individual 
modules  of  early  vision,  consider  new  ones  and  attack  the  problem  of 
their  fusion  for  a  reliable  and  robust  computation  of  the  2  J--I)  sketch. 
At  a  higher  level  wc  arc  studying  the  problem  of  representing  shape 
and  recognizing  objects  from  different,  complementary  perspectives. 
At  a  still  higher  level  wc  have  described  in  some  detail  the  idea  of 
visual  routines,  as  a  key  area  of  investigation  Tor  solving  efficiently 
and  reliably  many  high  level  visual  tasks,  including  navigation  and 
recognition. 

The  new  idea  of  regularization  analysis  promises  lo  unify 
at  least  part  of  our  research  in  early  vision.  It  also  suggests 
a  preliminary  classification  of  parallel  architectures  for  vision. 
Kcgulari/ation  algorithms  of  early  vision  typically  process  rctinotopic 
arrays  of  data  with  only  local  connections.  Non-regularization 
algorithms,  such  as  for  instance  some  of  die  visual  routines,  arc 
best  implemented  with  hardware  facilities  capable  of  manipulating 
pointers  and  setting  up  virtual  connections  between  spatially  non- 
adjaccnt  processors.  The  Connection  Machine,  being  developed  at  the 
Artificial  Intelligence  laboratory,  promises  to  become  soon  a  fertile 
ground  lor  experimenting  with  efficient  parallel  implementations  of 
our  vision  algorithms. 
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Abstract 

We  describe  research  in  intelligent  systems  for  Image 
Understanding.  The  ACRONYM  system  has  been  used 
in  recognition  of  industrial  parts  in  the  Intelligent  Task 
Automation  project.  A  system  for  intelligent  matching 
in  stereo  and  motion  sequences  is  under  development.  A 
geometric  modeling  system  using  stereo  images  has  been 
implemented.  A  sophisticated  symbolic  graphics  system 
for  a  very  general  class  of  generalized  cylinders  is  nearing 
completion.  A  preliminary  report  is  presented  of  reformu¬ 
lation  of  problems  in  computational  geometry.  Continuing 
analysis  of  the  interpretation  of  line  drawings  is  described. 
New  results  have  been  obtained  in  aggregation  and  in  edge 
segmentation.  A  survey  is  presented  of  commercial  array 
processors. 

I.  Introduction 

This  report  describes  research  in  intelligent  systems  for 
Image  Understanding  which  includes  geometric  modeling, 
geometric  reasoning,  analysis  of  image  structure,  and  in¬ 
terpretation  of  three  dimensional  structure.  One  goal  of 
the  research  is  an  intelligent  stereo  mapping  system. 

C'hclberg  81]  describe  using  the  previous  intelligent 
system,  ACRONYM  [lirooks  hlj,  for  recognition  of  indus¬ 
trial  parts  in  the  Intelligent  Task  Automation  project.  Sev¬ 
eral  extensions  were  necessary  in  geometric  modeling  and 
geometric  reasoning,  extensions  which  carry  over  to  the 
SUCCESSOR  system.  The  experience  in  this  project  has 
sharpened  requirements  for  the  new  system. 

A  system  for  obtaining  correspondence  in  stereo  and 
motion  sequences  is  near  the  stage  of  matching  image  se¬ 
quences  IDreschler-Eischer  til|.  The  system  uses  curves 
and  corners  as  image  features,  however  it.  performs  struc¬ 
tural  matching.  It  groups  features  within  single  images 
into  constellations  and  generates  plans  for  data-driven 
matching. 

A  geometric  modeling  system  has  been  built  which 
enables  construction  of  generalized  cylinder  models  and 
which  determines  a  class  structure  of  models  [Takaumra 
til],  The  user  interacts  with  the  system  by  menu  selection 
using  a  voice  input  device,  using  the  keyboard  only  for 


naming  parts.  Models  of  objects  arc  built  up  from  stereo 
pictures  of  objects  by  using  a  pointing  device  with  a  stereo 
station.  The  system  minimizes  the  number  of  input  points 
required  for  defining  generalized  cylinder  parts. 

A  system  is  nearly  completed  for  symbolic  graphic 
output  for  a  very  arge  class  of  generalized  cylinders  [Scott 
84] .  Generalized  cylinders  are  specified  by  spine,  cross 
section,  and  sweeping  rule,  each  of  which  can  be  a  gen¬ 
eral  function  of  one  variable  for  this  system.  The  sys¬ 
tem  is  novel  in  the  large  class  of  objects  which  can  be 
displayed.  The  hidden  surface  algorithm  is  related  to  ray 
tracing  which  is  the  method  of  choice  in  high  performance 
graphics.  Although  ray  tracing  is  a  brute  force  proce¬ 
dure  which  consumes  great  computational  resources,  this 
research  includes  an  analysis  of  computational  complex¬ 
ity  and  has  resulted  in  conceptual  improvements  to  limit 
complexity.  This  research  has  led  to  some  concepts  for 
generating  generic,  symbolic  predictions  of  possible  views 
of  objects. 

Research  into  choosing  representations  for  solving 
problems  in  geometric  reasoning  led  into  specializing  in 
reformulating  problems  in  computational  geometry  [Lowry 
84 [ .  Computational  geometry  algorithms  have  special  rele¬ 
vance  in  geometric  reasoning  and  geometric  database  oper¬ 
ations.  This  research  deals  with  problem-solving  in  compu¬ 
tational  geometry  with  computational  complexity  consid¬ 
erations  foremost,  related  to  strategy  selection  for  efficient 
algorithms  in  perceptual  reasoning.  In  other  research  in 
geometric  reasoning,  Chelberg  (unpublished)  made  a  sys¬ 
tem  for  symbolic  manipulation  for  problem  simplification 
by  approximation  of  algebraic  constraints. 

[Malik  81  ]  describes  inference  of  surfaces  from  line 
drawings.  Inference  rules  and  geometric  reasoning  arc  de¬ 
sorbed  together  with  an  enlightening  example.  The  ap¬ 
proach  is  particularly  relevant  because  previous  analyses 
had  ambiguities  of  11)0  to  200  for  the  simplest  drawings, 
that  of  a  tetrahedron. 

Research  continues  on  aggregation  and  segmentation. 
| Lowe  81  j  describes  a  program  which  groups  curve  seg¬ 
ments  which  form  continuous  curves.  [Nalwa  81  j  describes 
a  directional  edge  operator.  [Triendl  84]  describes  an  edge 
finding  system  which  determines  extended  curves  with  r.ir- 


cular  arc  and  straight  segments.  Dreschler  has  imple¬ 
mented  her  corner  finder. 

An  evaluation  of  short  term  improvements  to  compu¬ 
tation  power  for  III  has  been  carried  out  in  the  form  of  a 
survey  of  commercially  available  array  processors  [Lim  84]. 

(Blichcr  84j  notes  that  dimensional  arguments  dictate 
that  edges  cannot  be  localized  simultaneously  in  angle  and 
transverse  position  by  zero  crossings  of  a  single  convolution 
operator. 

A  mobile  robot  is  on  the  verge  of  being  operational 
without  sensing.  It  has  an  onboard  G8000  and  LSI/11, 
digital  communication  at  1200  baud,  analog  tv  transmitter 
and  Polaroid  acoustic  sensors. 


II  Systems 

The  Intelligent  Task  Automation  project  dealt  with  loca¬ 
tion  and  assembly  of  a  tray  of  fifteen  parts  in  an  uncon¬ 
trolled  environment  jOhelberg  84).  Occlusion  and  moder¬ 
ate  leaning  of  parts  on  one  another  were  permitted.  This 
was  a  new  class  of  parts  for  ACRONYM.  All  parts  were 
modeled.  To  do  so  required  extending  (he  modeling  system 
to  include  holes  or  negative  volumes,  helices,  and  to  repre¬ 
sent  stable  states.  ACRONYM  was  augmented  to  include: 
stable  states  in  reasoning;  to  reason  about  holes;  to  pre¬ 
dict  new  relations  among  primitives,  specifically  concen¬ 
tric  relations  between  concentric  cylinders  translated  into 
predictions  about  concentric  ellipses;  to  predict  connected 
relations  between  ellipses  and  ribbons;  to  predict  parallel 
relations  between  coils  of  a  spring,  and  enclosed  relations 
bet  ween  coils  of  a  spring  and  an  enclosing  boundary  of  the 
spring.  While  some  code  to  predict  ellipses  had  been  in  the 
original  ACRONYM  code,  ellipse  prediction  and  matching 
were  not  fully  implemented. 

Of  the  fifteen  parts,  ten  have  been  recognized  in  any 
stable  state.  Predictions  were  generated  automatically 
for  the  remaining  five  parts  but  matching  has  not  been 
achieved  for  them. 

|Dreshlcr-I'ischer  84|  describes  a  system  for  correspon¬ 
dence  in  stereo  pairs  and  motion  sequences.  Constraints 
in  correspondence  arc  maintained  as  separate  knowledge 


sources  which  can  change  dynamically  according  to  knowl¬ 
edge  acquired  in  operation.  Corners  and  curves  arc  the 
features  used  in  the  system.  However,  it  analyzes  struc¬ 
tures  or  constellations  of  these  features  in  each  image  and 
plans  a  matching  sequence  which  seems  effective  for  the 
image.  It  classifies  corners  formed  by  curves,  especially  to 
relax  constraints  on  T-junctions  which  indicate  occlusion 
and  for  which  no  correspondence  is  expected.  The  system 
groups  features  into  similarity  classes  which  might  be  am¬ 
biguous  under  the  local  matching  operation.  For  example, 
for  a  checker  board,  interior  corners  arc  all  similar,  while 
the  four  corners  of  the  checker  board  are  unique.  The  sys¬ 
tem  begins  by  matching  them.  Corners  connected  to  them 
are  then  unique.  Matching  takes  place  between  classes  of 
features,  rather  than  individual  features.  Matching  has 
been  tested  only  on  artificial  data. 

Ill  Geometric  Modeling 

The  geometric  modeling  system  |Takamura  84]  has  a  user 
interface  which  relics  on  commands  invoked  from  menus 
and  selected  with  voice  input.  The  keyboard  is  used  for 
naming  elements.  Geometric  forms  arc  specified  by  points 
which  are  entered  by  pointing  devices,  in  this  case  a  track¬ 
ball.  Points  arc  three  dimensional  points  determined  from 
stereo  pairs  of  pictures.  Generalized  cylinders  are  specified 
by  cross  section,  spine,  and  sweeping  ride.  Gross  sections 
can  be  specified  by  a  few  points,  e.g.  three  points  for  a 
circle  or  rectangle.  For  the  class  of  generalized  cylinders 
with  straight  spine  and  constant  sweeping  rule,  only  two 
cross  sections  are  required.  For  some  simple  cross  sec¬ 
tions,  a  total  of  only  four  points  are  required  for  an  object. 
For  complex  parts,  still  relatively  few  points  are  required, 
of  order  ten  points.  Parts  can  be  defined  from  others  by 
symmetry.  A  model  of  a  simple  object  wilh  a  few  parts  can 
be  built  in  10  minutes.  The  system  writes  out  a  textual 
model  of  the  part.  It  also  determines  object  classes  by  gen¬ 
eralization  of  constraints  which  determine  object  classes  in 

ACRONYM. 

A  symbolic  display  system  generates  .a  projection  of 
visible  surfaces  of  a  very  general  subclass  of  generalized 
cylinders  jScott  84).  Conceptually  it  ran  be  thought  of  as 
ray  tracing,  i.c.  projecting  back  rays  from  each  pixel  to 
intersect  objects,  ordering  intersections  of  surfaces  along 
each  ray  by  distance  from  the  image.  However,  it  is  waste- 


fill  to  order  surfaces  along  each  ray,  since  order  relations 
change  only  along  boundaries,  a  onc-dimcnsional  subset  of 
rays.  In  fact,  depth  relations  change  only  at  T  vertices  and 
cusps,  a  zero-dimensional  subset  of  rays.  This  is  a  striking 
decrease  in  computation  in  depth  ordering. 

Limbs  of  generalized  cylinders  are  obtained  by  an  iter¬ 
ative  search  stepwise  along  the  surface.  Surfaces  arc  spec¬ 
ified  by  spine,  cross  section,  and  sweeping  rule,  each  of 
which  may  be  an  arbitary  function  of  one  parameter  which 
is  evaluated  at  each  step.  Steps  arc  chosen  according  to  a 
uniform  quality  criterion.  The  step  distance  was  chosen  to 
to  give  a  constant  number  of  iterations  per  step.  About 
100  steps  were  required  for  a  three  turn  helix.  Objects 
are  defined  by  unions,  intersections,  and  negatives  of  these 
primitive  volumes  or  surfaces. 

Projections  of  limbs  are  put  into  an  image  quad-tree 
from  which  surface  ordering  is  obtained.  T-junctions  are 
found  here.  Because  the  determination  of  limbs  is  approx¬ 
imate  stepwise,  some  degradation  can  occur  from  lack  of 
resolution.  An  algorithm  has  not  been  yet  worked  out  to 
obtain  consistent,  realizable  approximations,  but  an  out¬ 
line  is  given  for  a  solution. 

The  system  has  generated  hidden  surface  views  of 
complex  parts,  but  the  Tull  system  for  structured  objects 
is  still  being  implemented. 

From  this  research  has  come  the  problem  of  determin¬ 
ing  the  locus  of  points  at  which  the  qualitative  structure  of 
projections  changes.  This  may  lead  to  compact  prediction 
of  image  appearance  for  complex  objects. 

IV  Geometric  Reasoning 

[Lowry  84]  describes  considerations  for  design  of  a  sys¬ 
tem  for  automatic  reformulation  of  algorithms  in  compu¬ 
tational  geometry.  An  implementation  is  underway  of  a 
system  to  solve  some  problems  in  computational  geome¬ 
try  with  computationally  efficient  solutions.  The  system 
is  intended  to  use  general  methods  which  arc  not  tailored 
to  the  particular  problems  to  be  solved.  Two  approaches 
are  distinguished.  The  first  approach  seeks  to  transform 
the  problem  into  the  form  of  a  computational  schema  like 
divide  and  conquer.  The  second  approach  seeks  to  dis¬ 
cover  properties  of  the  problem  which  can  be  exploited 


by  problem-solving  methods.  The  first  is  called  schema- 
driven,  the  second  constraint-driven. 

[Malik  84]  describes  extensions  of  the  analysis  of  [Bin- 
ford  81]  for  interpretation  of  surfaces  from  line  drawings. 
Implementation  has  begun  of  a  program  to  interpret  com¬ 
plex  line  drawings.  The  initial  analysis  is  primarily  aimed 
at  drawings  with  straight  lines.  The  notion  of  minimum 
number  of  surfaces  was  introduced,  leading  to  analysis  and 
constraints  on  invisible  surfaces.  Coplanarity  rules  have 
been  introduced  to  identify  surfaces  coincident  with  lines. 

A  particularly  interesting  example  is  used  to  demonstrate 
these  constraints. 

The  limitations  of  previous  analyses  were  most  strik¬ 
ing  in  that  they  tried  to  find  all  solutions.  [Draper  80] 
shows  that  there  are  many  solutions  for  the  simplest  draw¬ 
ings.  The  current  analysis  promises  to  reduce  this  ambi¬ 
guity  extensively. 

V  Segmentation 

[Nalwa  84]  describes  a  directional  operator  for  determining 
edge  elements  over  a  small  disk.  It  has  subpixel  position 
resolution  and  5  degree  angle  resolution  for  step  to  noise  of 
2.  The  operator  has  a  one-dimensional  function  oriented 
at  an  angle;  its  cross-section  function  is  tank.  First,  it 
determines  the  orientation  of  the  intensity  surface  in  the 
window  by  fitting  a  plane.  Then  it  fits  a  cubic  surface,  then 
a  tan h  surface,  and  a  quadratic  surface  fit  for  comparison 
to  determine  presence  or  absence  of  an  edge. 
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I.  Robust  Vision  Operators 

1.1.  Parameter  Networks  and  the  Hough  transform 

One  of  the  most  difficult  problems  in  vision  is 
segmentation  Recent  work  has  shown  how  to  calculate 
intrinsic  images  (eg,,  optical  flow,  surface  orientation, 
occluding  contour,  and  disparity).  These  images  are 
distinctly  easier  to  segment  than  the  original  intensity 
images.  Such  techniques  can  be  greatly  improved  by 
incorporating  Hough  methods.  The  Hough  transform  idea 
has  been  developed  into  a  general  control  technique. 
Intrinsic  image  points  are  mapped  (many  to  one)  into 
'parameter  networks'  [Ballard,  1983).  This  theory  explains 
segmentation  in  terms  of  highly  parallel  cooperative 
computation  among  intrinsic  images  and  a  set  of 
parameter  spaces  at  different  levels  of  abstraction. 

The  most  recent  application  of  these  ideas  are  to 
improved  shape-from-shading  calculations  which  work  on 
several  spaces  [Brown  et  al„  1983]  and  motion  extraction 
[Ballard  &  Kimball,  1983],  This  domain  specific  effort  is 
closely  linked  to  our  new  work  on  a  more  general  theory 
of  Hough-like  computations  and  general  implementation 
techniques  for  them. 

The  theory  is  also  useful  in  analysis  of  cache-based 
Hough  Transform  implementations.  It  is  an  appealing  idea 
to  use  a  small  content-addressable  store  to  accumulate 
Hough  transform  results,  rather  than  a  potentially  huge 
multi-dimensional  array.  The  initial  technical  issues  were 
discussed  in  [Brown  &  Sher,  1982|.  More  recent 
developments  are  presented  in  [Brown,  1983;  Brown, 
1984}.  We  are  currently  pursuing  VLSI  implementations. 

1.2  Hough  Transform  Implementation 

Earlier  work  on  the  Hough  transform  [Brown,  1983; 
Brown  &  Sher,  1982]  has  led  in  three  directions. 

1)  Research  toward  a  theory  of  cache  accumulator 
arrays  [Loui,  1983;  Brown  &  Feldman,  1983] 

2)  Experiments  with  complementary  HT  and 
cache  management  strategies  [Brown  et  al„  1983) 

3)  Hardware  (VLSI)  designs  for  HT  vote  caches 
[Sher  &  Tevanian.  1983). 


Work  in  each  of  these  directions  is  in  progress;  some 
of  the  cited  references  are  draft  dvicuments.  1  he  behavior 
of  caching  schemes  for  accumulation  of  votes  in  the 
Hough  transform  is  equivalent  to  the  statistical  problem  of 
estimating  the  mode  of  a  distribution  using  only  a  finite 
memory  for  vote  tallies,  and  is  a  generalization  of  the 
familiar  secretary  (  maximum  of  a  sequence.'  beauty 
contest  )  problem.  Loui  s  document  explores  this  avenue 
for  analysis.  The  experiments  with  HT  implementation  are 
to  see  how  well  the  peak-sharpening  provided  by 
complementary  HT  performs  with  real  images  on  complex 
shapes.  Work  on  cache  architectures  (hierarchical  schemes 
cascaded  caches)  is  ongoing. 

The  VL  SI  design  project  produced  a  circuit  for  vote 
cacheing  that  can  be  cascaded  to  provide  a  cache  of  anv 
length.  Work  on  improving  the  efficiency  and  power  of 
the  design  will  continue  this  summer. 

1.3  High  Level  Planning 

In  general,  problem  solvers  cannot  hope  to  create  plans 
that  are  able  to  specify  fully  all  the  details  of  operation 
beforehand  and  must  depend  on  run-time  modification  of 
the  plan  to  insure  correct  functioning.  Fhe  run-time 
planning  idea  becomes  particularly  important  when 
different  plan  segments  are  being  explored  concurrently. 
These  communicating  segments  may  require  sophisticated 
actions  e.g.  (do  PLA\  until  Pl.A\).  These  issues  are 
being  studied  by  [Russell]  in  the  context  of  a  cooperative 
planning  and  execution  system  for  manipulation  tasks.  A 
recent  effort  [Ballard.  1984)  is  examining  robot  planning 
from  a  task  frame  perspective. 

2.  Computing  with  Connections 

We  are  continuing  our  interest  in  problem-scale 
parallelism,  both  as  -  model  of  animal  brains  and  as  a 
paradigm  for  VL^.i  eldman  ei  al„  1984],  Work  at 
Rochester  has  concentrated  on  connectionist  models  and 
their  application  to  vision.  The  framework  is  built  around 
computational  modules,  the  simplest  of  which  are  termed 
p-units.  We  have  developed  their  properties  and  shown 
how  they  can  be  applied  to  a  variety  of  problems 
[Feldman  &  Ballard,  1982],  More  recently,  we  have 
established  powerful  techniques  for  adaptation  and  change 
in  these  networks  [Feldman,  1982]. 


A  major  milestone  was  achieved  with  Sabbah's  thesis 
on  massively  parallel  recognition  of  Origami-world  objects 
(Sabbah.  1982]-  Sabbah's  work  extended  the  connections 
methodology  to  a  problem  domain  with  several 
hierarchical  structural  levels.  The  resulting  program  is.  to 
our  knowledge,  the  most  noise- resistant  system  for  dealing 
with  this  level  of  complexity.  One  outcome  of  Sabbah's 
effort  has  been  a  project  to  build  a  general  purpose 
simulator  for  massivelv  parallel  svstems  [Small  et  al., 
I982[. 

The  general  connections  simulator  has  been  well 
tested  and  is  being  used  in  a  number  of  applications.  One 
project  involves  a  quite  detailed  simulation  of  motor 
control  networks  of  the  occulo-motor  system  (Addanki. 
1983|.  Another  application  is  to  a  spreading  activation 
model  of  word  sense  disambiguation  and  related  problems 
in  natural  language  understanding  [Cottrell  &  Small. 
1983).  A  major  new  effort  involves  modelling  conceptual 
knowledge  (such  as  that  needed  for  high  level  vision)  in 
connections  terms  [Feldman  &  Shastri.  1984:  Shastn  & 
Feldman.  1984) 

A  new  effort  has  been  the  development  of  a  much 
faster  C  version  of  the  simulator  and  the  exploration  of  its 
use  on  a  highly  parallel  machine.  We  have  received  major 
funding  for  a  new  highly  parallel  computer  which  will  be 
available  for  work  in  IL  tasks. 

For  a  VLSI  design  couse.  a  circuit  was  designed  to 
implement  key  aspects  of  the  "connections" 
computational  paradigm  [Rainero  &  Kautz,  1983|.  This 
cited  document  is  a  course  project  report,  and  the  exercise 
was  mainly  useful  in  isolating  particular  technical 
problems  that  must  be  addressed  in  any  such  parallel, 
activation-passing  computer. 

3.  Motion 

Our  interest  in  motion  has  centered  around  methods 
for  extracting  rigid  body  parameters  from  optic  flow  and 
intensity  images.  These  parameters  are  extremely  useful  in 
navigation  and  target  tracking.  Currently  these  nine 
parameters  (origin,  translational  velocity,  rotational 
velocity)  can  be  extracted  from  flow  via  a  Hough 
technique  [Ballard  &  Kimball,  1983).  A  more  recent  model 
exploits  multiple  channels  [Bandyopadhyay,  1984).  We  are 
also  pursuing  the  use  of  these  parameters  to  speed  up  the 
flow  computations  themselves  [Stuth  et  al..  1983].  A  major 
current  effort  relates  optical  flow  information  to  surface 
orientation  [Aloimonos  &  Brown,  1984)  and  sensor  motion 
[Aloimonos  &  Brown.  1984], 

4.  Shape 

The  description  and  recognition  of  complex  shapes 
continues  to  be  a  major  focus  of  the  project.  The  analysis 
of  the  dot  product  space  representation  has  been  improved 
to  handle  certain  pathological  cases,  and  has  been 
generalized  to  accommodate  different  criteria  for  the 
gvxidness  of  the  representation. 

This  simple  concept  of  shape  has  been  applied  to  the 
problem  of  reconstructing  three-dimensional  surfaces  from 


very  sparse  data.  The  key  idea  is  to  use  appropriate  shape 
descriptors  to  hypothesize  a  transformation  which  accounts 
for  the  difference  in  shape  between  successive  contours 
When  the  hypothesized  transformation  is  minor,  very 
simple-minded  surface  reconstruction  techniques  are 
sufficient.  When  there  are  major  differences  in  shape  or 
position  between  successive  contours,  our  method 
hallucinates  new  contours,  using  the  hypothesized  shape 
transformation  [Sloan  &  Hrechanyk.  1981).  A  major  new 
effort  is  the  extraction  and  use  of  symmetries  in  images 
[Freidberg  &  Brown,  this  Proceedings). 

Hierarchical  descriptions  of  shapes  were  considered  in 
[Ballard  &  Sabbah.  1981]  in  a  preliminary  fashion.  Our 
previously  reported  shape  model  (Hrechanyk  &  Ballard. 
1982]  concentrated  on  problems  of  view-invariance  and 
attention  shifting  within  a  single  prototype  13ns  model  has 
been  extended  to  handle  the  problems  of  extracting 
primitive  shape  descriptions  from  noisy  images.  Our  work 
was  motivated  by  dissatisfactions  with  smoothness  criteria 
for  intrinsic  image  amputations.  Recent  work  extends 
these  ideas  to  simple  3-D  shapes  [Ballard  et  al..  1984). 

The  practicality  of  shape  from  shading  computations 
and  their  interaction  with  the  determination  of  other 
image  parameters  (such  as  illuminant  position)  was 
addressed  by  two  papers  in  the  Fall.  1982  DARPA  Image 
Understanding  Workshop.  We  are  now  applying  the 
algorithm  to  real  images,  and  want  to  investigate  scenes 
with  non-Lambertian  reflectance  functions  that  are 
unknown  apriori.  We  want  to  explain  how  humans  in  fact 
use  shading  to  derive  shape,  given  the  complexity  of 
reflectance  functions  and  imaging  situations  in  the  world. 
Two  competing  theories  are  that  somehow  the  reflectance 
functions  are  derived  fairly  accurately  by  an  adaptive 
procedure,  or  instead  that  we  only  support  a  small 
number  of  reflectance  functions  that  are  selected  by  other 
cues  (such  as  gloss). 

5.  General  Theory  of  V  ision 

Work  in  our  laboratory,  among  others,  has 
demonstrated  strong  links  between  powerful  IU 
techniques  and  computations  used  by  animal  visual 
systems.  We  have  established  strong  ties  with  a  wide  range 
of  visual  scientists  at  Rochester  and  a  variety  of 
collaborative  efforts  are  underway.  One  early  project  is  to 
survey  the  computational  similarities  in  natural  and 
computer  vision  [Ballard  &  Coleman.  1983). 

We  have  begun  to  exploit  Rochester  neurobiology 
expertise  in  order  to  hone  and  improve  our  connectionist 
modelling  efforts.  One  difficult  avenue  is  to  specify  the 
interface  between  our  computational  models  and  the  state- 
of-the-art  neurobiological  picture.  Our  efforts  in  this 
direction  are  summarized  in  [Ballard  &  Coleman.  1983) 
and  the  collaboration  is  continuing.  Another  effort  is  our 
attempt  to  develop  a  general  framework  for  theories  of 
vision  that  would  provide  a  common  structure  for 
integrating  studies  from  various  disciplines  [Feldman. 
19821. 
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1.  ABSTRACT 

Our  research  is  divided  into  three  major  areas  3-D 
vision,  mapping  from  aerial  images  and  parallel  processing 
In  3-D  vision,  we  are  exploring  the  use  of  curvature 
properties  for  describing  surfaces,  and  techniques  of  cc  re¬ 
puting  generalized  cones  from  sparse  range  data  The 
work  on  mapping  concentrates  on  stereo  analysis  and  an 
"expert"  module  for  extracting  runways  in  a  major  airport 
complex.  The  parallel  processing  work  has  consisted  of 
the  design  of  a  new  cellular  architecture  and  mapping  of 
selected  Algorithms  on  it. 

2.  INTRODUCTION 

Our  research  in  the  recent  period  is  divided  into 
three  major  areas: 

a)  3-D  "Robotics"  Vision 

b)  Mapping  from  Aerial  Images 

c)  Parallel  Processing  of  IU  Algorithms 

Our  activities  in  each  of  the  areas  is  summarized 
below.  Other  papers  in  these  proceedings  from  our  group 
provide  more  details  of  some  of  the  specific  research,  in 
which  case  only  a  brief  note  is  provided  below  This 
paper  describes  the  work  of  many  individuals  in  our  group 
The  names  of  the  people  contributing  to  each  specific 
section  are  given  in  the  text 

3.  3-D  VISION 

The  aim  of  this  research  is  the  description  of  3-D 
objects  and  surfaces,  for  the  purposes  of  recognition, 
position  and  orientation  determination,  and  inspection  3- 
D  representations  can  describe  either  the  surface  or  the 
volume  of  an  object  Volume  descriptions  have  strong  ad¬ 
vantages  over  surface  descriptions  in  most  cases 
However,  some  smooth  surfaces,  eg  a  turbine  blade,  are 
"featureless”  and  possess  little  volume  and  may  be  better 
described  as  surfaces.  We  are  pursuing  both  kinds  of 
descriptions. 
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31.  3-D  Surface  Representation  (Medioni  and  Nevatia) 

A  common  method  for  representing  surfaces  is  by 
piecewise  approximation  with  some  bases  surfaces  Such 
methods  do  not  give  "natural”  descriptions  and  the  places 
where  the  piecewise  descriptions  meet  may  have  no  clear 
physical  properties  We  suggest  that  even  featureless 
surfaces  have  some  distinguishing  points  or  lines  on  them 
and  that  these  "features"  are  appropriate  for  matching  and 
for  generating  higher  level  descriptions  We  propose  that 
high  curvature  points  and  curves  joining  such  contiguous 
points  are  one  class  of  such  features  Such  points  are  on 
the  occluding  contours,  surface  discontinuities  and  on  the 
"ridges"  and  "valleys"  of  the  surfaces 

This  approach  and  some  results  are  described  in  a 
separate  paper  in  these  proceedings  [1] 

3.2.  Generalized  Cone  Descriptions  (Nevatia.  Kashipati 
and  Medioni) 

Generalized  cones  have  come  to  be  increasingly 
recognized  as  a  powerful  representation  scheme  for  3-D 
volumes.  However,  the  number  of  implementations  that 
compute  them  from  scene  data  is  rather  small,  and  they 
have  severe  limitations  Typically,  they  assume  perfect  3- 
D  data  and/or  knowledge  of  specific  objects  in  the  scene 
We  have  initiated  work  at  computing  such  descriptions 
from  sparse  3-D  data  as  might  be  generated  by  an  edge¬ 
line  based  stereo  system  (prior  to  interpolation)  Cur¬ 
rently.  our  method  applies  only  to  a  restricted  class  of 
generalized  cones  and  is  not  fully  tested,  but  we  expect  it 
to  generalize  to  a  much  broader  class  of  objects 

The  boundaries  (or  edges)  in  any  scene  can  be  con¬ 
sidered  to  be  in  the  following  classes,  following  the  tradi¬ 
tional  classification  introduced  by  Huffman  and  Clowes 
[Ch.  4  refs  48.5]  (see  Figure  1  for  an  example): 

1.  Occluding  boundaries:  This  is  a  boundary  where  the 
visible  surface  is  all  to  one  side  of  the  boundary 
(shown  by  "1"  in  Figure  1).  Previous  contour 
analysis,  of  Nevatia  and  Binford.  and  of  Marr.  essen¬ 
tially  assumes  that  these  are  the  only  visible  con¬ 
tours 

2.  Surface  slope  discontinuities:  These  may  be  caused 
by  slope  discontinuities  in  the  cross-section  shape, 
such  as  along  the  corners  of  the  cross-sections  of  a 
polyhedral  object  or  by  a  "terminator"  (i  e  the  end)  of 
a  generalized  cone  (shown  by  "2"  in  Figure  1).  These 
kinds  of  boundaries  would  cause  difficulties  in  the 
earlier  methods  of  analysis,  but  we  will  show  how 
they  may  provide  very  valuable  pieces  of  infor- 


mation. 

3.  Surface  Markings-  These  are  caused  by  changes  in 
the  surface  reflectance  rather  than  the  surface  posi¬ 
tion  or  slope  ("3"  in  Figure  1).  In  simplistic  analysis, 
these  may  be  confused  with  occluding  boundaries. 

4  Others-  Other  sources  are  due  to  noise,  shadows, 
highlights  etc.  Our  approach  does  not  deal  with 
them  explicitly,  but  should  work  in  their  presence 
(we  can  essentially  consider  them  to  be  same  as 
surface  markings  for  our  analysis). 

For  generalized  cones,  the  important  boundaries  could  be 
alternatively  classified  as  those  produced  from  "contour 
generators"  and  terminators  Intuitively,  contour 
generators  are  the  extremal  points  on  the  surface  which 
enclose  the  visible  surface  (and  are  thus  view-point 
dependent).  For  a  smooth  generalized  cone,  the  contour 
generators  are  the  points  on  the  surface  where  the  line  of 
sight  is  tangential  to  the  viewed  surface  More  generally, 
the  contour  generators  are  the  loci  of  the  extremal  points 
on  the  cross-sections  Note  that  contour  generators  are  a 
subset  of  the  occluding  boundaries  (In  this,  our  definition 
is  slightly  different  from  that  used  in  Shafer  (2],  but  has 
the  same  intuitive  notion  Also  we  will  not  differentiate 
between  "contours'  which  are  projections  in  the  image  of 
contour  generators  and  contour  generators.here  as  we  use 
3-D  boundaries ) 

The  terminators  of  a  generalized  cone  are  simply  its 
ends  (imagine  an  infinite  generalized  cone  that  has  been 
cut  at  two  ends)  Note  that  the  cut.  and  hence  a  ter¬ 
minator  need  not  be  planar,  and  when  planar,  it  need  not 
be  normal  to  the  axis.  Thus,  we  prefer  to  describe  a  right, 
circular  cone  with  a  slanted  cut,  as  a  straight, 
homogeneous  generalized  cone  with  an  oblique  termina¬ 
tion,  rather  than  as  an  homogeneous,  generalized  cone 
with  cross-section  changing  shape  at  the  end,  i.e.  our 
descriptions  are  necessarily  in  terms  of  right  generalized 
cones  Note  that  the  terminator  may  share  part  of  the  oc¬ 
cluding  boundary.  The  terminators  have  been  a  source  of 
difficulty  in  analysis  of  boundaries,  as  in  [3];  however,  they 
can  also  provide  valuable  clues  to  the  shapes  of  the 
cross-sections 

Now.  the  scene  description  problem  may  be  con¬ 
sidered  to  be  that  of  isolating  the  contour  generators  and 
terminators  of  the  generalized  cones  present  in  a  scene 
This  axis  of  the  generalized  cone  is  the  axis  of  symmetry 
of  the  contour  generators,  and  the  terminators  give  cross- 
section  shape  under  certain  conditions 

The  key  to  our  approach  consists  of  the  following 
observations  about  boundaries  of  generalized  cones 

1.  A  contour  generator  is  tangential  (in  3-D)  to  the  ter¬ 
minator  boundaries  (3-D  tangency  also  implies  a  2- 
D  tangency)  Further,  the  contour  generator  must  be 
to  one  side  of  the  plane  containing  the  local  contour 
tangent  and  the  viewing  point  (In  2-D.  The  ter¬ 
minator  boundary  must  be  all  on  one  side  of  the 
contour  at  the  junction.) 

2  For  a  linear,  straight,  homogeneous  generalized  cone 
(LSHGC),  the  contour  generator  is  planar  from  any 
view  (established  by  Shafer  [2])  For  a  non-linear 
SHGC,  the  contour  generators  are  planar  in  a  side 


view,  but  not  necessarily  in  an  oblique  view  If  we 
view  a  non-straight,  non-homogeneous  generalized 
cone  as  consisting  of  piecewise  LSHGCs,  then  we 
can  expect  the  contour  generators  to  be  "piecewise 
coplanar"  In  our  current  implementation,  we  have 
tested  only  the  LSHGCs.  but  believe  that  we  can  ex¬ 
tend  to  elongated  generalized  cones  by  using  the 
piecewise  approximations 

The  method,  then,  consists  of  computing  descrip¬ 
tions  from  a  given  set  of  segments  that  follows  the  above 
constraints.  In  our  current  implementation,  we  find  all 
possible  pairs  of  contour  generators  (i.e.  all  coplanar 
pairs),  then  compute  the  corresponding  terminators  and 
then  evaluate  the  descriptions.  Clearly,  for  a  complex 
scene,  it  will  not  be  feasible  to  examine  all  alternatives, 
and  some  choices  will  need  to  be  based  on  partial 
analysis  alone  Tracing  of  terminator  boundaries  is  rela¬ 
tively  straight-forward  when  there  are  no  gaps,  otherwise 
only  fragments  of  terminators  will  be  obtained. 

The  selection  of  descriptions  is  based  on  the  follow¬ 
ing: 

1.  Longer  contour  generators  are  preferred  over  shorter 
ones  (i.e  we  prefer  descriptions  with  longer  axes), 

2.  Parallel  contour  generators  are  preferred  (i.e. 
cylinders  are  preferred  over  cones), 

3.  Descriptions  with  closed  and/or  planar  terminators 
are  preferred. 

With  this  simple  implementation,  we  are  able  to  find 
good  descriptions  for  simple  objects  such  as  cylinders  and 
cones.  We  have  analyzed  scenes  with  some  occlusion  and 
surface  markings,  but  with  no  missing  segments 

4.  MAPPING  FROM  AERIAL  IMAGES 

This  is  part  of  our  continuing  research  from  previous 
years  In  the  past,  we  have  developed  methods  for  linear 
feature  extraction  and  region  segmentation,  image  to  map 
correspondence,  stereo  analysis,  and  shadow  analysis. 
These  have  been  reported  in  previous  DARPA  Image  Un¬ 
derstanding  workshops,  our  internal  reports  and  other 
open  literature.  Our  current  work  focusses  on  two 
projects:  stereo  analysis  and  development  of  an  "expert" 
module  for  airport  analysis 

4.1.  Stereo  Analysis  (Mohan,  Medioni  and  Nevatia) 

Stereo  is  relied  upon  heavily  by  human  analysts  in 
mapping  from  aerial  images  In  many  cases,  stereo  may 
be  the  only  direct  cue  for  obtaining  3-D  data  in  aerial 
images  (direct  ranging  is  often  not  practical).  In  previous 
work,  we  reported  a  stereo  algorithm  that  matches  the  line 
segments  detected  in  the  two  images  [4].  We  feel  that 
being  an  edge  based  scheme,  it  has  important  advantages 
over  the  conventional  intensity  correlation  based  stereos, 
and  also  over  methods  that  match  unconnected  edges 
Our  recent  work  has  been  in  a  detailed  testing  and  evalua¬ 
tion  of  this  algorithm  on  a  variety  of  scenes  One  of  the 
complex  images  we  have  been  working  with  is  the 
"Pentagon"  image  shown  in  Figure  2  (resolution  512  x  512), 
this  is  a  good  test  image  due  to  the  presence  of  parallel 
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lines,  and  fine  detail,  and  there  is  some  comparative  data 
available  from  other  work  [51  Our  tests  have  indicated 
several  minor  and  major  areas  for  further  improvement 

1.  We  found  that  imposing  a  constraint  that  the  con¬ 
trast  of  the  matching  segments  be  similar  caused 
our  system  to  loose  many  good  matches.  We  have 
eliminated  this  constraint  but  still  use  the  sign  of  the 
contrast  This  also  indicates  that  correlation  stereo 
would  have  difficulties  in  these  instances  too 

2  Order  preservation  -  our  original  algorithm  did  not 
explicitly  require  any  order  to  be  preserved  among 
the  matching  segments.  We  have  incorporated  an 
order  constraint  and  tested  its  effects  Our  algo¬ 
rithm  works  with  segments  rather  than  edges,  and 
no  complete  order  can  be  defined  on  segments  in  2- 
D  We  define  and  use  a  pairwise  order  instead  -  the 
order  of  two  segments  is  defined  by  their  relative 
positions  in  the  parts  where  they  overlap  in  the 
direction  of  the  epipolar  lines  This  order  is  well 
defined  and  unambiguous  as  segments  may  not  in¬ 
tersect.  The  order  is  used  in  matching  in  the  fol¬ 
lowing  way  evaluation  of  a  matching  pair  (i,j) 
depends  on  how  well  the  disparity  of  (i,j)  agrees  with 
the  disparities  of  other  matching  segments  pair  (k,l) 
in  the  neighborhood  of  i  and  j,  if  the  order  of  (k.l)  is 
not  the  same  as  that  of  (i.j)  then  (k.l)  contributes  a 
large  penalty  weight  to  the  evaluation  of  (i.j).  (We 
are  omitting  the  details  here  as  the  evaluation  func¬ 
tion  is  rather  complex:  it  is  described  in  detail  in  [4]). 
Our  experience  with  using  the  order  constraint  in¬ 
dicates  that  the  use  of  order  removes  some  incor¬ 
rect  matches,  and  finds  matches  for  some  segments 
left  unmatched  previously  However,  the  improve¬ 
ments  seem  to  be  rather  minor,  at  least  for  the  ex¬ 
amples  we  have  tested  on.  Perhaps,  this  is  an  in¬ 
dication  of  the  robustness  of  the  original,  unordered 
algorithm  itself 

3.  The  third  area  of  improvement  has  been  in  the  im¬ 
plementation  of  a  hierarchical  matching  method  As 
the  resolution  of  the  image  increases,  so  does  the 
number  of  segments  detected  in  it.  In  our  case, 
since  we  use  2-D  windows  to  establish  context,  the 
number  of  segments  in  a  window  grows  rapidly  with 
resolution.  We  can  control  this  complexity  by  using 
the  information  from  a  lower  resolution  match  to 
guide  a  higher  resolution  match. 

A  match  of  a  certain  resolution  gives  disparities  of 
the  points  on  the  segments  that  match.  These  dis¬ 
parities  are  used  to  compute  disparities  for  in- 
between  points  by  linear  interpolation.  These  inter¬ 
polated  values  provide  the  initial  disparity  for  the 
match  at  the  next  higher  level  Since  the  ap¬ 
proximate  disparity  is  known,  the  search  for  a  match 
is  now  restricted  to  a  small  range,  thus  limiting  the 
computational  complexity  and  also  reducing  chances 
of  an  error  Note  that  the  errors  made  at  an  earlier 
stage  will  not  be  corrected  at  higher  resolution  by 
this  method,  though  they  do  not  necessarily 
propagate  (the  higher  resolution  segments  may  end 
up  with  no  matches). 


Figure  3  shows  the  segments  extracted  and  matched 
from  Figure  2.  reduced  to  a  resolution  of  128  x  128 
The  number  of  matched  segments  is  359  Figure  4 
shows  the  segments  matched  at  the  resolution  of 
256  x  256,  the  number  of  matched  segments  is  now 
1340  We  believe  that  the  matching  results  for  this 
image  are  impressive  and  almost  ail  segments  are 
matched  correctly  However,  we  still  lack  a  suitable 
way  of  displaying  the  results  in  3-0  as  linear  inter¬ 
polation  is  sensitive  to  local  errors  which  mark  the 
other  data  in  a  perspective  display  As  a  partial 
solution.  Figure  5  displays  the  points  with  disparity 
values  that  would  put  their  height  a  certain  minimum 
amount  above  ground  (about  half  the  height  of  the 
building  itself)  It  is  clear  that  major  parts  of  the 
pentagon  building  are  detected  correctly. 

Some  Remaining  Problems 

-  Our  most  important  problem  is  the  detection  of  seg¬ 
ments  that  are  matched  incorrectly  The  number  of 
such  mismatches  is  small,  but  some  of  them  can 
have  a  significant  effect  on  the  interpretation  of  the 
image.  We  are  currently  investigating  use  of  more 
global  context  to  identify  and  correct  such  matches 

-  We  need  to  develop  a  suitable  way  of  interpolating 
and  displaying  surfaces  Many  sophisticated 
methods  of  surface  interpolation  exist  in  the  litera¬ 
ture  (e  g  Terzopoulous  [61).  our  main  difficulty  is  in 
identifying  errors  and  discontinuities 

4.2.  Developing  "Domain  Experts"  (Huertas.  Nevatia  and 

Price) 

As  a  step  towards  developing  visual  domain  experts, 
we  have  selected  the  task  of  mapping  a  major  commercial 
airport  complex.  Our  first  chosen  task  is  to  find  the  run¬ 
ways  and  taxiways  One  may  think  that  these  objects 
would  be  easy  to  extract  and  would  consist  of  relatively 
homogeneous,  elongated  linear  strips  In  reality,  these  ob¬ 
jects  are  rather  complex,  an  example  of  the  Los  Angeles 
International  Airport  (LAX)  is  shown  in  Figure  6  The  in¬ 
tensity  of  a  runway  surface  is  not  uniform,  in  fact,  in  one 
of  the  runways  the  surface  material  itself  changes  from 
concrete  to  blacktop,  perhaps  due  to  extensions  made  at 
different  periods  There  are  many  markings  on  the  run¬ 
ways  and  taxiways  -  some  are  intended  to  aid  pilots, 
others  are  due  to  factors  such  as  tire  tread  marks,  dirt, 
etc.  The  runways  and  taxiways  have  many  intersections 
and  the  contrast  with  the  shoulders  is  not  always  strong 
The  difficulty  of  mapping  in  face  of  the  above  com¬ 
plexities  is,  of  course,  dependent  on  the  amount  of  a  priori 
knowledge  that  is  assumed  If  we  assume  that  we  are 
looking  at  LAX  and  have  a  previous  map.  the  task  of  locat¬ 
ing  runways  is  much  easier  We  are  taking  a  mid-way  ap¬ 
proach  that  assumes  that  we  are  looking  at  a  major  com¬ 
mercial  airport  and  know  the  altitude  of  the  camera,  but 
not  the  identity  of  the  scene  Thus,  we  can  infer  a  range 
for  widths  and  lengths  of  the  runways  and  taxiways  but  do 
not  know  their  specific  relative  locations  and  orientations 
Our  approach  is  to  first  extract  linear  line  segments 
in  the  image  using  our  "LINEAR"  software  (which  incor- 


porates  boti  the  Nevatia-Babu  [7]  and  the  Marr-Hildreth 
(81  edge  detectors).  Figure  7  shows  the  segments  ex¬ 
tracted  from  Figure  6  (the  original  image  is  3000  x  800 ) 
Then  we  group  the  line  segments  into  'anti-parallel"  pairs 
or  APARS  (parallel  lines  of  opposite  contrast  or 
orientation).  If  the  runways  followed  the  simple  model  of 
being  homogeneous,  linear  strips  we  could  expect  the 
APAR  grouping  (assuming  a  known  range  of  widths)  to 
isolate  them.  With  the  complications  described  above,  we 
can  expect  to  get  APARs  not  only  between  the  two  edges 
of  the  runways,  but  between  the  edges,  the  markings,  the 
treadmarks  etc.  and  also  between  these  features  and  sur¬ 
rounding  features  such  as  edges  of  the  shoulder  area  (the 
full  set  of  APARs  is  not  shown,  as  the  picture  is  too  clut¬ 
tered  to  convey  information  easily). 

Now.  our  task  is  to  separate  the  desired  APARs  from 
the  host  of  the  others  Also,  these  is  no  single  APAR  that 
spans  the  entire  length  of  the  runway  and  we  must  join 
the  appropriate  fragments  At  this  stage,  we  need  to  use 
our  knowledge  of  the  airports.  Our  implementation  is  in 
very  preliminary  stages,  we  will  describe  our  approach 
first  and  then  the  specific  implementation 

Broadly  speaking,  we  wish  to  follow  a  'generate  and 
test"  paradigm  Then,  the  work  to  be  done  is  in  rules  and 
methods  for  hypothesizing  and  verifying  for  this  particular 
problem.  One  way  of  hypothesizing  is  to  search  for  an 
area  of  least  ambiguity  a-'d  to  propagate  the  inferences 
For  example,  if  in  a  certain  part  of  the  runway  (usually  at 
one  end  that  gets  little  traffic)  we  find  only  a  single  AP/.R 
that  has  the  appropriate  width,  we  can  try  to  extend  it. 
Another  method  is  to  combine  evidences  from  various 
pieces  and  prefer  those  that  have  broad  support.  For  ex¬ 
ample.  if  a  set  of  APARs  are  colinear.  we  can  hypothesize 
a  longer  APAR  and  try  to  verify  it.  Verification  might  be 
by  checking  if  we  can  explain  the  departure  of  the  data 
from  the  ideal  by  the  expected  causes  So  far.  we  have 
implemented  only  simple  hypothesis  and  verification 
methods  which  are  described  below 

The  first  step  is  to  find  those  APARs  that  may  be 
part  of  a  runway  or  taxiway  The  direction  of  the  runways 
may  be  known  a  priori,  if  we  are  analyzing  a  known  air¬ 
port  In  our  case,  we  constructed  a  length  we  ghted  his¬ 
togram  of  segment  directions  and  used  the  peak  to  get  an 
approximate  direction  for  the  runway  (we  are  assuming 
that  the  scene  around  runways  is  dominated  by  lines 
along  and  on  the  runway)  We  also  estimate  minimum  and 
maximum  widths  for  runways  and  taxiways  Again,  these 
should  be  easily  available  for  a  known  airport,  or  even  for 
a  known  type  of  airport.  We  derived  these  widths  from  a 
length  weighted  histogram  of  APAR  widths  The  set  of 
potential  APARs  that  might  be  part  of  runways  is  then 
constructed  by  choosing  those  that  are  in  the  appropriate 
direction  and  have  widths  between  the  required  bounds 
Further  they  are  required  to  have  an  aspect  ration  ex¬ 
ceeding  5  to  1  Figure  8  shows  such  a  set  constructed 
from  Figure  7  (the  display  shows  the  APAR  as  a  rectan¬ 
gular  box) 

The  next  stage  consists  of  severs!  steps  Those 
APARs  that  share  a  common  segment,  and  are  colinear 
and  of  same  "color"  (brighter  or  darker  than  surround)  are 
hypothesized  to  form  a  new  longer  APAR  (the  assumption 


is  that  we  are  likely  to  miss  one  side  of  a  runway  due  to 
intersections  or  other  reasons)  A  test  on  the  acceptability 
of  the  merged  APAR  is  also  performed  (described  below) 
and  APARs  that  are  completely  contained  in  another  are 
eliminated  Figure  9  shows  the  results  of  the  processing 
after  this  stage 

The  runways  are  still  in  many  disconnected  APARs 
The  next  stage  is  to  hypothesize  that  colinear  APARs  are 
connected  Before  a  connection  is  established,  we  examine 
the  gap  between  two  APARs  to  be  connected,  looking  for 
evidence  contrary  to  their  being  connected  Figures  10  (a) 
and  (b)  show  a  connection  hypothesis  (the  area  to  be 
bridged  is  shown  in  bold  dashes)  and  the  segments  in  this 
area  At  this  time,  the  verification  consists  simply  of  ex¬ 
amining  whether  there  are  many  segments  normal  to  the 
direction  of  APAR.  as  they  would  represent  a  potential 
obstacle  (the  exact  measure  used  is  a  ratio  of  the  sum  of 
the  lengths  of  the  segments  in  the  near  normal  directions 
to  the  sum  of  the  lengths  of  the  segments  in  the  APAR 
direction)  The  verification  process  needs  to  be  made 
much  more  informed  and  not  dependent  on  a  single  num¬ 
ber  as  this  Figure  11  shows  the  four  longest  resulting 
APARs  and  include  the  desired  runways  (the  top  and  the 
bottom)  We  have  also  obtained  similar  results  for  extract¬ 
ing  taxiways  (not  shownl 

This  example  is  not  intended  to  indicate  that  we 
have  developed  a  general  system  that  is  capable  of  ex¬ 
tracting  runways  and  taxiways  in  complex  images,  but 
simply  to  show  the  complexity  of  the  problem  and  the 
value  of  certain  types  of  processing 

5  PARALLEL  PROCESSING  OF  IU  ALGORITHMS 
(Moldovan.  Dixit,  and  Tenorio) 

In  the  past  year,  we  have  initiated  a  serious  study  of 
the  parallel  processing  of  IU  algorithms  Our  approach  has 
been  to  take  specific  algorithms  that  are  representative  of 
broad  class  of  IU  algorithms  and  study  their  parallel  im¬ 
plementation  We  have  been  concentrating  on  the  'higher 
level"  algorithms  Our  studies  have  been  mostly  in  the 
context  of  a  parallel  architecture,  called  SNAP,  proposed  at 
USC  by  Prof  Moldovan  and  his  students  SNAP  consists 
of  a  2-D  array  of  processors  with  each  processor  con¬ 
nected  to  its  four  neighbors.  There  is  also  a  global  bus 
for  broadcasting  to  all  processors  simultaneously  We 
have  looked  at  the  mapping  of  discrete  relaxation  and 
general  production  systems  or  this  machine  The  results 
are  described  in  another  paper  lit  these  proceedings  [91 
The  design  of  SNAP  is  not  complete  at  this  time,  and  we 
can  only  get  a  preliminary  estimate  of  the  suitability  of 
this  type  of  architecture  for  certain  types  of  IU  algorithms 
Complete  complexity  analysis  of  a  complex  vision  algo¬ 
rithm  on  a  complex,  parallel  machine  remains  a  difficult 
task,  our  efforts  represent  very  early  attempts 
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1  -  occluding  boundaries 

2  -  surface  discontinuities 

3  -  surface  markings 


Figure  1:  Two  occluding  objects  and  a  classification  of 
the  boundaries 


Figure  3  Segments  detected  and  matched  in  Figure  1  at 
128x128  resolution 
(a)  left  (b)  right 


Figure  4:  Segments  detected  and  matched  if  Figure  1  at 
256x256  resolution 
(a)  left  (b)  right 


Figure  5:  Points  that  are  at  least  a  certain  height  above 
ground 
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Figure  7:  Segments  detected  from  Figure  5 


Image  Understanding  Research  at  CMU 


Takeo  Kanade 

Computer  Science  Department 
Carnegie- Mellon  University 
Pittsburgh,  PA  15213 


In  the  CMU  Image  Understanding  Program  we  have  been 
working  on  both  the  basic  issues  in  understanding  vision 
processes  which  deal  with  images  and  shapes,  and  the  system 
issues  in  developing  demonstratable  vision  systems.  This  report 
reviews  our  progress  since  the  June  1983  workshop  proceedings. 
The  highlights  in  our  Program  include: 

•  Shafer  has  been  working  on  techniques  lor  dealing 
with  shadows,  rellectance,  color  and  highlights. 

•  Ohla  and  Kanade  have  developed  a  stereo  program 
using  dynamic  programining  techniques. 

•  Thorpe  and  Matthias  have  demonstrated  a  vision- 
based  navigation  system  with  obstacle  avoidance  on 
the  CMU  Hover. 

» Lucas  has  applied  the  method  of  differences  in 
obtaining  the  camera  motion  from  an  image 
sequence.  Kanade  has  been  developing  a  new 
theory  lor  motion  from  image  sequences. 

•  Kanade,  Herman,  Tomita,  and  Smith  have  been 
developing  techniques  lor  generating  and  matching 
scene  descriptions  from  3D  range  imagery  and  aerial 
photos. 

•  Webb  and  Dew  have  been  working  toward  a  vision 
system  on  a  CMU-designed  systolic  machine  called 
Warp. 

•  McKeown  has  continued  to  work  in  the  digital 
mapping  and  photo  interpretation  area. 

1 .  Shapes  and  Images 


1.1.  Shadow  Geometry  for  Solids  ol  Revolution 
We  have  been  producing  an  implementation  of  Shafer's 
theories  [19]  of  shadow  geometry  aiid  occluding  contour 
interpretation  for  solids  of  revolution.  We  now  have  a  collection 
of  programs  that  produce  and  analyze  occluding  boundaries  and 
shadows  of  solids  of  revolution.  Preliminary  results  indicate  that, 
for  an  image  of  a  solid  of  revolution  with  shadow  quantized  to  a 
100  x  100  grid,  the  surface  normals  on  a  solid  of  revolution  can  be 
estimated  to  ar  accuracy  of  1%  in  the  p  and  q  coordinates  in 


gradient  space,  using  the  methods  of  Sl.afer's  thesis.  We  also 
have  a  negative  result:  apparently,  there  is  not  sufficient 
information  in  the  surface  normals  along  the  terminator  to 
uniquely  compute  the  angle  of  view  (angle  between  the  line  of 
sight  and  the  axis)  in  an  image  of  a  solid  of  revolution.  Howevei , 
alternative  methods  can  be  used  for  finding  this  angle,  such  as 
examining  the  elliptical  images  of  the  ends  of  the  object.  We  have 
assembled  a  set  of  Images  of  real  solids  of  revolution,  and  are 
implementing  a  hand-segmentation  program  to  produce  contours 
from  these  images  that  can  be  used  as  input  to  the  existing 
analysis  programs. 

1 .2.  Reflectance  Maps  Under  Perspective 

We  have  been  examining  the  use  of  reflectance  maps  in  images 
taken  under  perspective  projection.  The  effect  of  the  [cos*  a] 
term,  previously  noted  by  Horn  anil  Sjoberg  [6],  seems  to  be 
smaller  than  the  effect  of  the  varying  viewing  direction  upon  ihe 
photometric  angles  e  and  g.  For  non-lambertian  surfaces,  the 
specular  peak  in  the  reflectance  map  thus  varies  considerably  in 
size,  shape,  and  position  in  gradient  space.  This  suggests  that, 
for  precise  photometric  image  interpretation,  the  reflectance  map 
must  be  determined  for  each  pixel  individually.  We  are  exploring 
the  magnitude  of  this  effect  with  synthetic  data,  including  the 
possibility  of  algorithms  based  on  compromises  such  as  "lazy 
evaluation"  of  the  reflectance  map  or  the  use  of  a  collection  of 
reflectance  maps,  each  for  some  rectangular  region  of  the  image 
and  representing  the  "average"  rellectance  map  for  that  region. 

1.3.  Color  and  Highlights 

Color  and  highlights  are  important  image  features,  yet  very  little 
research  has  been  done  on  them  from  the  computer  vision  point 
of  view.  We  have  produced  a  theory  [20]  that  describes  the 
relationship  between  color  and  highlights  in  an  image  of  a  glossy 
surface.  Because  specular  and  diffuse  reflection  are  caused  by 
different  physical  processes  (interface  versus  deep  reflection), 
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they  produce  different  colors  in  the  image.  The  pixel  values 
measured  by  the  camera  will  be  a  sum  of  these  two  components, 
i.e.  a  linear  combination  of  the  RGB  color  of  the  specular 
reflection  and  the  color  of  the  diffuse  reflection  at  each  point. 
Since  these  colors  are  constant  across  a  surface,  the  pixel  values 
measured  across  a  single  surface  will  form  a  parallelogram  in 
R  G  B  color  space.  The  position  of  any  pixel's  color  wilhin  that 
parallelogram  yields  the  magnitude  of  the  specular  and  diffuse 
reflection  at  that  point.  This  theory  applies  to  rough  surfaces  of 
inhomogeneous  media,  i.e.  most  plastics,  paints,  glazed 
ceramics,  glass,  etc.  The  color  theory  will  work  under  extended 
light  sources,  perspective  imaging  projection,  and  curved 
surfaces.  We  are  working  now  to  extend  the  theory  to  deal  with 
diffuse  ("ambient")  illumination,  and  we  hope  to  calibrate  a 
camera  to  use  in  implementing  the  theory  on  real  images. 

1 .4.  Stereo 

We  have  implemented  a  stereo  algorithm  using  dynamic 
programming  [18].  Edge-based  stereo  involves  two  searches. 
One  is  intra  scanline  search  for  obtaining  correspondence  of 
edge-delimited  intervals  on  each  scanline  pair  after  the  pair  of 
images  has  been  rectified  so  that  epipolar  lines  are  horizontal. 
The  intra-scanline  search  can  be  treated  as  the  problem  of  finding 
a  matching  path  on  a  t.vo  dimensional  (2U‘)  search  plane.  The 
other  is  inter  scantine  search  for  possible  correspondence  of 
vertically  connected  edges  across  scanlines  in  right  and  left 
images.  This  provides  the  consistency  constraints  that  inter- 
scanlme  matchings  should  --  'tisfy.  Previous  use  of  dynamic 
programming  has  been  limited  to  the  intra-scanline  search  [4,  1], 
Henderson  processed  pairs  of  scanlines  sequentially  where 
results  of  the  previous  line  guided  searches  in  the  succeeding 
scanline.  Baker  [tj  employed  post  processing  on  the  results  of 
intra  scanline  search  to  enforce  Hie  inter-scanline  constraints. 

We  utilize  dynamic  programming  for  both  of  the  inter-scanline 
and  intra  scanline  searches,  and  the  two  searches  proceed 
simultaneously  the  first  supplies  the  consistency  constraint  to  the 
second  while  the  second  supplies  the  matching  score  to  the  first. 
An  interval  based  similarity  metric  is  used  to  compute  the  score. 
By  simultaneously  considering  intra  and  inter  scanline  searches, 
the  correspondence  problem  in  stereo  can  be  cast  as  that  of 
finding  in  a  three  dimensional  search  space  an  optimal  matching 
surface  that  most  satisfies  the  intra  scanline  matches  and  inter- 
scanline  consistency  Here,  a  matching  surface  is  defined  by 
stacking  20  matching  paths,  where  the  2D  matching  paths  are 


found  in  a  2D  search  plane  whose  axes  are  left-image  column 
position  and  right  image  column  position,  and  the  stacking  is 
done  in  the  direction  of  the  row  (scanline)  number  of  the  images. 
The  cost  of  the  matching  surface  is  defined  as  the  sum  of  the 
costs  of  the  intra  scanline  matches  on  the  2D  search  planes, 
while  vertically  connected  edges  provide  the  consistency 
constraint  across  the  2D  search  planes  and  thus  penalize  those 
intra-scanline  matches  which  are  not  consistent  across  the 
scanlines. 

The  algorithm  has  been  tested  with  various  images  including 
urban  aerial  images,  synthesized  images,  and  block  scenes.  The 
results  show  that  the  3D  search  achieves  roughly  1/10  the  error 
rate  of  the  2D  search.  For  some  images  containing  a  large 
number  of  edges,  the  3D  search  requires  as  much  as  10  to  15 
times  longer  processing.  However,  the  processing  time  is 
expected  to  be  reduced  drastically  by  implementing  the  algorithm 
on  a  parallel  machine  such  as  a  systolic  array  processor. 

2.  Vision  for  Navigation 

The  IUS  group  has  been  working  on  navigational  vision  in 
collaboration  with  the  CMU- Robotics  Mobile  Robot  Lab.  The 
effort  includes  path  planning,  motion  determination,  and  obstacle 
detection  using  video  and  sonar  data  [22,  23]. 

2.1 ,  A  Visual  Navigation  System 

Using  a  CMU-rover  testbed  vehicle,  a  visual  navigation  system 
has  been  demonstrated  by  Thorpe  and  Matthies  [10]  to  maneuver 
to  a  pre  defined  location  in  a  static  environment.  The  visual 
system  is  based  on  algorithms  developed  by  Moravec  for  the 
Stanford  Cart.  At  each  cart  position,  these  algorithms  used 
stereo  correspondence  in  nine  camera  images  to  triangulate  the 
distance  to  potential  obstacles.  Motion  of  the  vehicle  was 
determined  by  tracking  these  obstacles  over  time.  Detailed 
evaluations  have  been  made  on  the  performance  of  the 
navigational  vision  system  during  the  evolution  from  the  Stanford 
Cart  to  the  present  CMU  system.  The  results  of  the  evaluation 
have  led  to  the  use  of  fewer  images  per  step,  to  the  use  of  more 
constraints  to  limit  the  search  in  the  correspondence  process, 
and  toward  the  use  of  a  different  motion  solving  alqorithm  that 
better  exploits  the  rigid  motion  of  the  scene. 

2.2.  Obtaining  Camera  Motions  from  Image  Sequences 

Lucas  and  Kanade  [9]  applied  the  method  of  differences  to  the 

problem  of  obtaining  camera  motions  from  an  image  sequence. 


The  method  of  differences  refers  to  a  technique  for  image 
matching  that  uses  the  intensity  gradient  of  images  to  iteratively 
improve  the  match  between  the  two  images.  When  the:  iterative 
scheme  is  combined  with  image  smoothing,  the  method  exhibits 
good  accuracy  and  wide  convergence  range.  This  method  will  be 
used  in  the  actual  demonstration  system  of  visual  navigation. 

Kanade  |8)  is  currently  working  on  a  new  theory  for  motion  Iroin 
image  sequences  that  relates  line  correspondences,  differential 
motions  of  the  viewer,  and  temporal  and  spatial  differentials  of 
images.  The  theory  consists  of  two  parts.  The  first  part  provides 
the  formulas  to  solve  the  differential  motion  of  the  viewer  troth  line 
correspondences.  The  second  part  relates  the  line 
correspondence  problem  with  temporal  and  spatial  image 
differentials.  Interestingly,  this  theory  seems  to  provides  a  way  to 
cleverly  get  around  the  aperture  problem  that  arises  as  an 
ambiguity  in  matching  points  using  only  the  local  image 
properties  within  a  window.  When  viewed  as  a  line 
correspondence  problem,  local  image  properties  do  provide 
enough  information. 

3.  Vision  on  a  Systolic  Machine  Warp 

We  have  started  developing  a  vision  system  on  a  systolic  array 
machine.  The  machine  called  Warp  is  being  designed  and  built 
by  H  T  Kung's  VLSI  group  at  CMU.  The  original  design  of  Warp 
was  for  a  parallel  architecture  providing  high  speed  floating  point 
computation  for  signal  processing.  This  design  has  been 
changed  in  cooperation  with  the  IUS  group  to  provide  the 
computational  flexibility  needed  for  doing  a  variety  of  low-level 
vision  tasks. 

Warp  consists  of  ten  cells  organized  in  a  linear  array.  Each  cell 
contains  a  32  bit  floating  point  adder  and  multiplier,  4k  of  152-bit 
word  micro  store,  and  4K  of  32-bit  RAM,  as  well  as  other  registers 
to  provide  sufficient  control.  The  adder  and  multiplier,  which  are 
pipelined,  each  produce  one  floating-point  result  every  200 
nanoseconds  for  a  throughput  of  10  MFLOPS  per  cell  or  100 
MFLOPS  lor  the  whole  array.  The  cells  include  facilities  for 
systolic  processing  or  limited  local  address  generation,  so  that 
each  cell  can  be  programmed  individually  to  do  a  different 
computation,  or  the  whole  array  can  do  a  coordinated  systolic 
computation.  The  10  cell  machine  will  be  interfaced  v/ith  a  VAX 
11/7 fTO  through  an  ARTEC  DPS  2400  interface  machine,  which 
provides  1  Mbyte  of  memory  arid  2 -I  Mbyte/sec  bandwidth. 


At  present,  Webb  is  working  to  make  Warp  a  useful  tool  for 
low  level  vision  [8)  This  will  involve  programming  a  large  number 
of  low  love)  vision  subroutines,  and  integrating  Warp  into  a  full 
scale  visual  task  The  low  level  vision  subroutine  library  will 
provide  a  ready  source  ol  programs  for  someone  who  wants  to 
use  Warp,  as  well  us  providing  numerous  guides  in  programming 
Warp  for  a  new  programmer  A  library  of  vision  subroutines  has 
been  studied  and  classified  Irom  the  point  of  implementing  them 
on  Warp  The  assembler  and  simulator  for  Warp  have  been 
written.  Several  sample  Warp  programs  including  3x3 
convolutions  have  been  written. 

Dew  (2]  has  investigated  re  implementation  of  FIDO  algorithms 
(visual  obstacle  avoidance  algorithms  by  Moravec  and  Thorpe)  on 
Warp;  we  anticipate  a  speed  up  of  a  factor  of  10  from  Warp 
implementation  compared  with  VAX  implementation.  We  also 
plan  to  investigate  use  of  Warp  for  other  navigation  tasks, 
including  path  following. 

4.  3D  Vision  System:  Generation  and 
Matching  of  3D  Scene  Descriptions 

4.1.  From  3D  Range  Imagery 

We  have  been  active  in  developing  techniques  for  generating 
three-dimensional  scene  descriptions  from  range  irrageiy  and  for 
matching  scene  descriptions.  These  techniques  will  be  applied  to 
both  industrial  vision  problems  and  outdoor  scene  analysis  to  be 
used  for  outdoor  navigation. 

Two  pieces  of  worn  have  been  done  with  different  approaches 
and  emphasis.  The  first  piece  of  work,  by  Smith  (21],  produces 
object-centered  three-dimensional  descriptions  starting  from 
point-wise  3D  range  data  obtained  by  a  light-stripe  rangefinder.  A 
careful  geometrical  analysis  shows  that  contours  which  appear  in 
light  stripe  range  images  can  be  classified  into  eight  types,  each 
with  different  characteristics  in  occluding  vs.  occluded  and 
different  camera/illuminator  relationships.  Starting  with 
detecting  these  contours  in  the  iconic  range  image,  the 
descriptions  are  generated  moving  up  a  hierarchy  from  contour, 
surface,  object,  to  scene.  We  use  conical  and  cylindrical 
su.faces  as  primitives.  The  emphasis  in  this  work  is  data-driven 
bottom  up  ai.'onomous  processing,  generating  object 
descriptions  from  complicated  scenes  without  referring  to 
specific  pre  stored  object  models.  Therefore,  in  the  process  of 
generating  descriptions,  we  exploit  the  fact  that  general  coherent 


relationships,  such  as  symmetry,  collinearity,  and  having  a 
common  axis,  which  are  present  among  lower- level  elements  in 
the  hierarchy  allow  us  to  hypothesize  upper  level  elements.  The 
analysis  program  has  been  applied  to  complex  scenes  containing 
cups,  pans,  and  toy  shovels. 

The  second  piece  of  work,  by  Tomita[24],  is  edge-based  and 
directed  toward  object  recognition.  A  light-stripe  rangefinder 
image  is  first  segmented  into  edges  and  surfaces.  This 
segmentation  is  done  in  3D  space;  edges  are  classified  as  either 
3D  straight  lines  or  circular  curves,  and  surfaces  are  either  planar 
or  conic.  An  object  model  consists  of  component  edges  and 
surfaces  and  their  interrelationships.  Our  model  representation 
can  accommodate  not  only  objects  with  rigid,  fixed  shape,  but 
also  objects  with  articulations  between  their  parts,  such  as 
rotational -joint  or  linear  slide  motions.  The  matching  process  is 
rather  straightforward.  A  transformation  from  an  object  model  to 
the  scene  is  hypothesized  by  initially  matching  a  few  scene 
features  with  model  features.  The  transformation  is  then  tested 
with  the  rest  of  the  features  for  verification.  The  object  model  is 
constructed  either  interactively  by  using  sample  scenes  or  by 
deriving  the  model  representation  from  the  PADL-2  solid- 
modeling  system. 

During  the  course  of  our  work  we  are  accumulating  facilities 
useful  for  acquiring,  processing,  and  displaying  three- 
dimensional  range  images.  The  collection  includes:  data 
acquisition  system  for  industrial  setting,  a  program  library  for 
boundary  detection,  3D  curve  segmentation,  3D  edge  defection, 
etc,  a  3D  display  program  using  fast  3D  graphics  to  quickly  see 
the  data  and  model  cfesciiption  overlayed. 

4.2.  From  Aerial  Photos;  3D  Mosaic  System 

The  3D  Mosaic  system  now  can  combine  both  stereo  image 
analysis  and  monocular  image  analysis  as  sources  of  information 
to  be  accumulated  into  a  consistent  3D  description  of  a  scene. 
Each  view  of  the  scene,  which  may  be  either  a  single  image  or  a 
stereo  pair,  undergoes  analysis  which  results  in  a  3D  wireframe 
description  that  represents  portions  of  edges  and  vertices  of 
objects.  The  surface  based  scene  description  is  constructed 
from  the  wireframes  With  each  successive  view,  the  description 
is  incrementally  updated  and  gradually  become  more  accurate 
and  complete  Task  specific  knowledge,  involving  block  shaped 
objects  in  an  urban  scene,  is  used  to  extract  the  wireframes  and 
construct  and  update  the  model  A  complete  report  is  included  in 
these  proceedings  (5j. 


5.  Digital  Mapping  and  Photo 
Interpretation 

Work  in  the  area  of  Digital  Mapping  and  Photo 
Interpretation  [13]  has  continued  and  expanded  into  two  new 
areas.  The  first  is  map  guided  feature  extraction,  and  the  second 
area  involves  new  research  in  the  area  of  building  rule  based 
systems  for  photo  interpretation. 

5.1.  Map  Guided  Feature  Extraction 

We  have  developed  an  approach  to  map  guided  image  analysis 
using  a  region-based  segmentation  system.  A  full  paper  is 
included  in  these  proceedings  [14],  This  segmentation  system 
has  been  used  to  search  a  database  of  images  that  are  in 
coriespondence  with  a  geodetic  map  to  find  occurrences  of 
known  buildings,  roads,  and  natural  features.  The  map  predicts 
the  approximate  appearance  and  position  of  a  feature  in  an 
image.  The  map  also  predicts  I  he  area  of  uncertainty  caused  by 
errors  in  the  image  to  map  correspondence.  The  segmentation 
process  then  searches  for  image  regions  that  satisfy  2- 
dimensional  shape  and  intensity  criteria.  If  no  initial  region  is 
found,  the  process  attempts  to  merge  together  those  regions  that 
may  satisfy  these  criteria.  Several  detailed  examples  of  the 
segmentation  process  are  given. 

This  work  uses  the  conceptmap  database  [1 1  ]  from  the  maps 
system  [12, 15]  as  its  source  of  map  knowledge.  In  the 
conceptmap  database,  map  knowledge  is  represented  as  three 
dimensional  descriptions  of  man  made  features,  natural  features, 
and  conceptual  features.  Examples  of  man  made  features  are 
buildings,  roads,  and  bridges;  natural  features  are  rivers;  lakes, 
and  forests,  and  conceptual  features  are  political  boundaries, 
residential  neighborhoods,  and  business  areas.  These  feature 
positions  are  represented  in  the  map  database  in  ierms  of 
(latitude,  longitude,  elevation). 

5.2.  Rule  Based  Interpretation  of  Aerial  Imagery 

We  have  begun  research  on  the  development  of  a  rule  based 
system,  spam  [17,  16],  that  uses  map  and  domain  specific 
knowledge  to  interpret  airport  scenes.  This  research  investigates 
the  use  of  a  rule  based  system  for  the  conti  ol  of  image 
processing  and  interpretation  of  results  with  respect  to  a  world 
model,  as  well  as  the  representation  of  the  world  model  within  an 
imege/map  database. 


spam,  A  System  for  Photo  interpretation  of  Airports  using 
MAPS,  is  an  image  interpretation  system.  It  coordinates  and 
controls  image  segmentation,  segmentation  analysis,  and  the 
construction  of  a  scene  model.  It  provides  several  unique 
capabilities  to  bring  map  knowledge  and  collateral  information  to 
bear  during  all  phases  of  the  interpretation.  These  capabilities 
include: 

•  The  use  of  domain  dependent  spatial  constraints  to 
restrict  and  refine  hypothesis  formation  during 
analysis. 

•  The  use  of  explicit  camera  models  that  allow  for  the 
projection  of  cartographic  information  onto  the 
image. 

•  The  use  of  image  independent  metric  models  for 
shape,  sue,  distance,  absolute  and  relative  position 
computation. 

•  The  use  of  multiple  image  cues  to  verify  ambiguous 
segmentations.  We  look  for  instances  of  features  in 
previous  or  contemporary  images  to  detect  missing 
features  in  the  model. 

The  task  of  airport  image  analysis  has  several  interesting 
properties  First,  airports  are  a  complex  organization  of  man¬ 
made  structures  placed  over  a  large  ground  area  While  the 
actual  spatial  arrangement  of  typical  structures  such  as  runways, 
terminal  buildings,  parking  lots,  etc.  varies  greatly  between 
airports,  the  types  of  structures  normally  found  in  an  airport 
scene  are  well  understood.  The  airport  task  provides  a 
"knowledge  rich"  environment,  where  functional  relationships 
between  structures  preclude  arbitrary  spatial  arrangements  and 
provide  spatial  constraints  One  can  view  the  interpretation 
problem  from  the  generalist  standpoint,  and  test  how  rule-bas^d 
interpretation  can  capture  models  of  "generic"  airports  as  well  as 
use  knowledge  about  the  layout  of  specific  airports  in  a  site- 
specific  manner. 

Second,  a  body  of  literature  [3.  7]  on  airport  planning  is  readily 
accessible  and  provides  general  design  constraints.  Knowledge 
acquisition  for  spatial  constraints  therefore  does  not  involve 
exammat'on  of  large  numbers  of  sample  airports. 

Currently  the  system  can  extract  and  identify  some  runways, 
taxiways,  grassy  areas,  and  buildings  from  region  based 
segmentations  for  several  images  of  National  Airport  in 
Washington  D  C  It  builds  multiple  plausible  functional  area 
descriptions,  which  support  (he  runway  and  access  road  portions 
of  the  scene  model  It  cannot  currently  recognize  a  complete 
airport  scene  that  satisfies  its  internal  model  Many  problems 
remain,  some  in  reliable  low  level  feature  extraction  from  the 


imagery,  others  in  the  design  and  implementation  of  effective 
recognition  strategies  using  the  rule  based  approach.  However, 
we  believe  that  the  integration  of  map  knowledge,  image 
processing  tools,  and  rule  based  control  and  recognition 
strategies  will  be  shown  to  be  a  powerful  computational 
organization  for  automated  feature  extraction  from  aerial  imagery. 
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ABSTRACT 

This  report  summarizes  research  carried  out 
during  the  period  December  1982-August  1984  on 
Contract  DAAK70-83-K-0018 .  The  focus  of  this  re¬ 
search  is  on  image  understanding  techniques  appli¬ 
cable  to  autonomous  land  vehicle  naviga>  ion.  Par¬ 
ticular  emphasis  has  been  placed  on  t ime-vary ing 
imagery  analysis,  but  some  work  has  also  been  done 
on  two-dimensional  shape  analysis  and  three-dimen¬ 
sional  object  recognition.  A  second  major  empha¬ 
sis  was  the  development  of  an  approach  to  road 
network  following  and  obstacle  avoidance  tasks; 
this  work  has  now  received  major  additional  fund¬ 
ing  under  the  DARPA  Autonomous  Vehicle  Program. 


1.  INTRODUCTION 

The  Computer  Vision  Laboratory  of  the  Center 
for  Automation  Research  at  the  University  of 
Maryland  first  received  support  under  the  DARPA 
Image  Understanding  Program  in  1976.  This  support, 
which  was  funded  through  the  U.S.  Army  Night  Vision 
and  Electro-Optics  Laboratory  in  Fort  Belvoir,  VA, 
emphasized  the  development  of  techniques  for  de¬ 
tecting  tactical  targets  on  Infrared  imagery. 

In  December  1982  a  new  effort,  entitled  "Auto¬ 
nomous  Vehicle  Navigation",  was  initiated  at 
Maryland  under  the  Image  Understanding  Program. 

This  work  was  funded  through  USANVEOL  under  Con¬ 
tract  DAAK7 0-83-K-0018 ,  with  Dr.  George  R.  Jones 
as  COTR  and  the  Westinghouse  Corp.  as  a  subcon¬ 
tractor. 

The  focus  of  the  research  being  conducted 
under  this  project  is  the  development  of  image 
understanding  techniques  applicable  to  autonomous 
land  vehicle  navigation,  particular  emphasis  has 
been  placed  on  time-varying  imagery  analysis,  but 
some  work  has  also  been  done  on  two-dimensional 
shape  analysis  and  three-dimensional  object  recog¬ 
nition.  A  second  major  emphasis  was  the  develop¬ 
ment  of  an  approach  to  road  network  following  and 
obstacle  avoidance  tasks;  this  work  has  now  re¬ 
ceived  major  additional  funding  under  the  DARPA 
Autonomous  Vehicle  Program. 


2.  TIME-VARYING  IMAGERY  ANALYSIS 

Earlier  work  on  analysis  of  optical  flow 
fields  at  Maryland  dealt  with  propagation  of  local 
object  motion  estimates  along  contours  and  with 
flow  field  smoothing.  An  especially  powerful 
smoothing  technique  [1]  utilizes  global  information 
about  the  motion  field,  derived  from  the  histograms 
of  the  components  of  the  estimated  motion.  The 
method  is  an  adaptation  of  the  "superspike"  image 
enhancement  algorithm  to  motion  field  estimation. 
Experiments  indicate  that  the  method  can  yield  more 
accurate  and  precise  estimates  of  motion  than  pre¬ 
viously  proposed  motion  estimation  algorithms. 

A  major  new  effort  on  optical  flow  field  ana¬ 
lysis  was  initiated  when  Dr.  Allen  M.  Waxman  joined 
the  University  of  Maryland  in  May,  1983.  Dr. 
VJaxman's  initial  work  on  the  structure  from  motion 
problem  was  done  in  collaboration  with  Shimon  Ullman 
of  MIT  [2].  This  work  involved  a  new  formulation 
and  method  of  solution  of  the  image  flow  problem. 

The  two-dimensional  image  flow  is  generated  by  the 
relative  rigid  body  motion  of  a  smooth,  textured 
object  along  the  line  of  sight  to  a  monocular  camera . 
By  analyzing  this  evolving  image  sequence,  one  hopes 
to  extract  the  instantaneous  motion  (described  by 
six  degrees  of  freedom)  and  local  structure  (slopes  and 
curvatures)  of  the  object  along  the  line  of  sight. 

The  formulation  relates  a  new  local  representation  of 
an  image  flow  to  object  motion  and  structure  by  twelve 
nonlinear,  algebraic  equations.  The  representation 
parameters,  termed  observables ,  are  given  by  the  two 
components  of  image  velocity,  three  components  of 
rate-of-strain,  spin,  and  six  independent  image  gra¬ 
dients  of  rate-of-strain  and  spin,  evaluated  at  the 
point  on  the  line  of  sight.  These  kinematic  vari¬ 
ables  are  motivated  by  the  deformation  of  a  finite 
element  of  flowing  continuum.  A  method  for  solving 
these  equations  was  devised  and  successfully  imple¬ 
mented  on  a  VAX  computer.  A  number  of  examples 
were  explored  revealing  two  classes  of  ambiguous 
scenes  (i.e.,  nonunique  solutions  are  obtained). 

A  sensitivity  analysis  was  also  begun  in  order  to 
estimate  noise  levels  in  the  representation  para¬ 
meters  which  still  yield  acceptable  solutions;  indi¬ 
cations  are  that  the  method  is  quite  stable.  Final¬ 
ly,  an  approach  is  suggested  by  which  the  kinematic 
variables  may  be  extracted  from  evolving  contours 
in  an  image  sequence. 


Dr.  Waxman  has  formulated  a  general  "Image 
flow  paradigm"  [3]  describing  the  relationship 
between  a  three-dimensional  (3-D)  scene  consisting 
of  several  objects  in  rigid  body  motion,  and  its 
associated  two-dimensional  (2-D)  time-varying  ima¬ 
gery.  The  paradigm  addresses  a  number  of  theore¬ 
tical  issues:  What  is  a  useful  representation  for 
the  2-D  flow  field  and  how  can  it  be  ODtained  from 
the  time-varying  imagery?  How  should  a  flow  field 
be  segmented  and  how  do  these  segmentation  boun¬ 
daries  relate  to  the  3-D  scene  itself?  How  is  the 
3-D  structure  and  motion  of  objects  in  the  scene 
recovered  from  the  2-D  flow  representation  and  its 
segmentation?  In  attempting  '  o  answer  these  ques¬ 
tions,  a  variety  of  interesting  concepts  arise  such 
as  image  neighborhood  deformation,  evolving  con¬ 
tours,  space-time  stream  tubes,  virtual  contours 
and  virtual  tubes,  flow  analyticity,  boundaries 
of  analyticity,  kinematic  analysis  and  structure- 
motion  compatibility.  The  various  "elements"  of 
the  paradigm  are  supported  by  analytic  techniques, 
some  of  which  have  already  been  developed,  some  of 
which  are  now  under  investigation,  and  others  which 
remain  to  be  studied. 

Initial  work  on  implementation  of  the  para¬ 
digm  has  been  carried  out  as  part  of  the  Ph.D. 
dissertation  of  Mr.  Kwangyoen  Wohn  [4],  In  the 
kinematic  analysis  of  time-varying  imagery,  where 
the  goal  is  to  recover  object  surface  structure 
and  space  motion  from  image  flow,  an  appropriate 
representation  for  the  flow  field  consists  of  a 
set  of  deformation  parameters  which  describe  the 
rate-of -change  of  an  image  neighborhood.  We  have 
developed  methods  for  extracting  these  deformation 
parameters  from  evolving  contours  in  an  image 
sequence;  the  image  contours  being  manifestations 
of  surface  texture  seen  in  perspective  projection. 
Our  results  follow  directly  from  the  analytic 
structure  of  the  underlying  image  flow;  no  heuris¬ 
tics  are  imposed.  The  deformation  parameters  we 
seek  are  actually  linear  combinations  of  the  Taylor 
series  coefficients  (through  second  derivatives) 
of  the  local  image  flow  field.  Thus,  a  by-product 
of  our  approach  is  a  second-order  polynomial  ap¬ 
proximation  to  bhe  image  flow  in  the  neighborhood 
of  a  contour.  For  curved  surfaces  this  approxima¬ 
tion  is  only  locally  valid,  but  for  planar  surfaces 
it  is  globally  valid  (i.e.,  it  is  exact).  (Xir 
analysis  reveals  an  "aperture  problem  in  the 
large"  in  which  insufficient  contour  structure 
leaves  the  set  of  twelve  deformation  parameters 
under-determined.  We  also  assess  the  sensitivity 
of  our  method  to  the  simulated  effects  of  noise  in 
the  "normal  flow"  around  contours,  as  well  as  the 
angular  field  of  view  subtended  by  contours.  The 
sensitivity  analysis  is  carried  out  in  the  context 
of  planar  surfaces  executing  general  rigid  body 
motions  in  space.  Future  work  will  address  the 
additional  considerations  relevant  to  curved  sur¬ 
face  patches. 

An  Image  Flow  Simulator  for  the  experimental 
study  of  optical  flow  fields  was  developed  as  part 
of  the  M.S.  thesis  of  Mr.  Sarvaj  it  S.  Sinha  [5], 

The  purpose  of  the  Simulator  is,  from  a  knowledge 
of  structure  and  motion,  to  display  the  2-D  image 
sequence  and  associated  flow.  This  3-D  graphics 


animation  package  simulates  motion  of  objects 
through  space  and  also  the  evolution  of  surface 
contours  through  time.  It  includes  graphics  algo¬ 
rithms  for  projection,  clipping,  hidden  surface 
removal,  shading  and  animation. 

Another  part  of  Mr.  Sinha's  M.S.  thesis  deals 
with  "dynamic  stereo"  (6),  a  new  concept  in  passive 
ranging  to  moving  objects  which  is  based  on  the 
comparison  of  multiple  image  flows.  It  is  well 
known  that  if  a  static  scene  is  viewed  by  an  obser¬ 
ver  undergoing  a  known  relative  translation  through 
space,  then  the  distance  to  objects  in  the  scene 
can  be  easily  obtained  from  the  measured  image  velo¬ 
cities  associated  with  features  on  the  objects 
(i.e.,  motion  stereo).  But  in  general,  individual 
objects  are  translating  and  rotating  at  unknown 
rates  with  respect  to  a  moving  observer  whose  own 
motion  may  not  be  accurately  monitored.  The  net 
effect  is  a  complicated  image  flow  field  in  which 
absolute  range  information  is  lost.  However,  if  a 
second  image  flow  field  is  produced  by  a  camera 
whose  motion  through  space  differs  from  that  of  the 
first  camera  by  a  known  amount,  the  range  informa¬ 
tion  can  be  recovered  by  subtracting  the  first  image 
flow  from  the  second.  This  "difference  flow"  must 
then  be  corrected  for  the  known  relative  rotation 
between  the  two  cameras,  resulting  in  a  divergent 
relative  flow  from  a  known  focus  of  expansion.  This 
passive  ranging  process  may  be  termed  Dynamic 
Stereo .  the  known  difference  in  camera  motions  play¬ 
ing  the  role  of  the  stereo  baseline.  We  have  de¬ 
veloped  the  basic  theory  of  this  ranging  process, 
along  with  some  examples  for  simulated  scenes. 
Potential  applications  are  in  autonomous  vehicle 
navigation  (with  one  fixed  and  one  movable  camera 
mounted  on  the  vehicle),  coordinated  motions  be¬ 
tween  two  vehicles  (each  carrying  one  fixed  camera) 
for  passive  ranging  to  moving  targets,  and  in  in¬ 
dustrial  robotics  (with  two  cameras  mounted  on  dif¬ 
ferent  parts  of  a  robot  arm)  for  intercepting  mov¬ 
ing  workpieces. 

In  collaboration  with  Dr .  Jacqueline  LeMoigne, 
Dr.  Waxman  has  also  been  studying  the  feasibility 
of  using  projected  light  grids  to  construct  range 
maps  of  a  robot's  immediate  environment.  They 
have  addressed  a  number  of  operational  considera¬ 
tions  and  image  processing  tools  relevant  to  this 
task  domain,  including  the  issues  of  operating  in 
ambient  lighting,  smoothing  of  range  texture,  grid 
pattern  selection,  albedo  normalization,  grid  ex¬ 
traction  and  coarse  registration  of  image  to  pro¬ 
jected  grid. 

A  study  of  object  tracking  and  occlusion 
analysis  has  been  carried  out  in  connection  with 
the  M.S.  thesis  of  Mr.  Nader  Razor  (7).  A  system 
was  developed  that  builds  a  map  of  the  environment 
by  tracking  moving  objects  and  detecting  instances 
of  occlusion.  This  information  is  used  to  place 
bounds  on  the  ranges  and  bearings  of  the  stationary 
objects  in  the  scene. 

3.  OTHER  TOPICS 

A  method  of  recovering  closed  boundaries  of 
two-dimensional  shapes  from  disconnected  boundary 
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segments  was  developed  as  part  of  the  Ph.D.  disser¬ 
tation  of  Ms.  Tsai-Yun  Phillips,  under  the  direc¬ 
tion  of  Dr.  Takashi  Matsuyama  of  Kyoto  University. 

A  geometric  labeling  scheme  was  introduced  for  the 
Voronoi  diagram  of  a  set  of  straight  line  segments 
in  the  Euclidean  plane,  and  a  method  was  developed 
of  recovering  the  medial  axis  of  a  closed  boundary 
by  using  the  labeled  Voronoi  diagram  [8].  Algori¬ 
thm  were  then  developed  for  computing  the  labeled 
Voronoi  diagram  for  a  set  of  digital  line  segments, 
using  a  labeled  Euclidean  distance  transform  [9]. 
Finally,  a  digital  algorithm  for  extracting  the 
digital  medial  axis  from  t  he  labeled  Voronoi  dia¬ 
gram  was  implemented  [10]. 

A  study  of  visibility  in  planar  polygonal  re¬ 
gions  was  conducted  in  connection  with  the  M.S. 
thesis  of  Mr.  Mark  Doherty  [11].  Algorithms  were 
developed  for  finding  minimal  sets  of  points  from 
which  the  entire  region  is  visible.  The  relation¬ 
ship  between  this  task  and  that  of  decomposing  the 
region  into  a  minimal  union  of  star-shaped  subsets 
was  also  investigated. 

The  use  of  Hough  transform  methods  for  three- 
dimensional  object  recognition  was  investigated  as 
part  of  the  Ph.D.  dissertation  of  Ms.  Teresa  M. 
Silberberg  [12] .  This  involved  an  iterative  proce¬ 
dure  in  which  straight  line  segments  in  the  image 
are  matched  by  finding  the  parameters  of  a  viewing 
transformation  of  a  three-dimensional  model  con¬ 
sisting  of  line  segments.  Assuming  the  scale  of 
the  object  is  known,  there  are  three  orientation 
and  two  translation  parameters  to  be  estimated. 
Initially  a  sparse,  regular  subset  of  parameters 
and  transformations  is  evaluated  for  goodness-of- 
fit;  then  the  procedure  is  repeated  by  successively 
subdividing  the  parameter  space  near  current  best 
estimates  of  peaks. 

In  the  area  of  software  development,  an  inter¬ 
face  between  the  C  and  Franz  Lisp  languages  was 
implemented  and  documented  [13],  The  documentation 
describes  in  detail  how  C  functions  may  be  called 
from  Lisp  and  vice  versa. 

4.  AUTONOMOUS  LAND  VEHICLE  NAVIGATION 

The  goal  of  the  research  being  conducted  on 
this  project  is  to  demonstrate  the  roles  that 
vision  .ran  play  in  navigation.  Our  approach  to 
visual  navigation  involves  segmenting  the  general 
navigation  task  into  three  levels,  called  long- 
range,  intermediate-range,  and  short-range  naviga¬ 
tion.  The  general  flow  of  control  between  levels 
is  that  goals  flow  from  levels  of  greater  abstrac¬ 
tion  to  levels  of  lesser  abstraction  (long-inter¬ 
mediate-short)  and  status  information,  concerning 
the  achievement  of  those  goals,  passes  in  the  oppo¬ 
site  direction.  Each  level  of  navigation  maintains 
a  map  of  (some  subset  of)  the  environment  to  be 
navigated,  with  the  map  information  becoming  more 
concrete  as  one  moves  from  long  down  to  short  range 
navigation.  Specific  sensors  and  visual  capabili¬ 
ties  are  also  associated  with  each  level  of  navi¬ 
gation;  these  sensors  and  algorithms  function  to 
maintain  the  correctness  of  the  map  representation 
at  that  level. 


The  long-range  navigator  is  responsible  for 
determining  general  paths  through  regions  of  uni¬ 
form  vlsiblllty/navigabllity.  Its  map  is  a  low- 
resolution  decomposition  of  the  environment  into 
regions  in  which: 

1)  a  minimal  number  of  landmarks  (either 
specified  in  the  map  or  acquired,  dyna¬ 
mically,  during  navigation)  will  be 
visible  from  most  points  within  the 
region  (uniform  visibility),  and 

2)  the  terrain  is  of  uniform  composition 
(uniform  navigability)  . 

The  long-range  navigator  constructs  a  path, 
represented  as  a  sequnece  of  regions,  from  this 
map  that  will  get  the  vehicle  from  the  starting 
location  to  its  goal.  This  path  might  be  edited 
as  the  vehicle  navigates  through  its  environment 
based  on  information  obtained  at  lower  levels  of 
navigation  and  passed  up  to  the  long-range  navi¬ 
gator  . 

The  principal  visual  capability  of  the  long- 
range  navigator  is  landmark  recognition.  By  recog¬ 
nizing  a  sufficient  number  of  landmarks  to  suffi¬ 
cient  accuracy,  the  long-range  navigator  can  com¬ 
pute  the  location  of  the  vehicle  to  the  accuracy 
required  for  continuing  the  mission.  The  long- 
range  navigator  is  responsible  for  planning  a 
sequence  of  landmark  recognition  tasks,  and  then 
controlling  each  of  these  recognition  tasks.  The 
control  involves  first  analyzing  map  information 
to,  e.g.,  predict  the  appearance  of  the  landmark 
and  the  background  against  which  it  will  likely 
appear  from  the  vehicle’s  perspective.  Secondly, 
the  camera  system,  which  will  include  an  electroni¬ 
cally  controlled  zoom  lens  and  pan /tilt  mechanism, 
must  be  controlled  to  locate  and  identify  the  land¬ 
mark  with  sufficient  angular  resolution. 

In  summary,  the  inputs  to  the  long  range 
navigator  include  a  coarse  visibility  map,  visual 
models  of  landmarks,  a  low  resolution  terrain  map, 
and  a  mission  description.  This  navigator  must  be 
capable  of  object  recognition  and  of  path  genera¬ 
tion  from  a  region  graph.  Its  internal  data  rep¬ 
resentations  include  a  map  of  regions  of  uniform 
visibility/navigability;  the  location  of  the  vehi¬ 
cle;  and  a  sequence  of  (paths  through)  regions. 

The  Intermediate  range  navigator  maintains  a 
map  which  is  a  subset  of  the  long-range  map,  but 
represents  that  subset  of  the  environment  in 
greater  detail  based  on  analyses  of  its  vision 
system.  Its  navigation  task  is  to  compute  a  path 
through  a  region  of  uniform  visibility/navigability 
by  identifying  what  we  call  corridors  ot  free 
space.  A  corridor  of  free  space  is  a  straight 
swath  of  navigable  terrain  that  is  not  so  densely 
populated  with  obstacles  that  the  vehicle  could  not 
maneuver  among  them.  The  intermediate  range  navi¬ 
gator  constructs  this  sequence  of  corridors  by  ini¬ 
tially  considering  a  straight  line  path  of  nominal 
width,  and  then  perturbing  this  path,  by  increasing 
its  width  and  splitting  it  into  subsegments,  to  get 
a  minimal  length  path  of  sufficient  "instantaneous" 
nav  igability . 
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The  vision  capabilities  required  at  this 
level  of  navigation  are  by  far  the  most  complex  of 
all  three  levels.  The  intermediate  range  navi¬ 
gator  needs  to  maintain  a  three-dimensional  model 
of  its  environment  to  ranges  far  greater  than  can 
be  explored  using  active  ranging;  furthermore,  it 
requires  a  sophisticated  object  recognition  and 
terrain  analysis  capability.  The  vision  algorithms 
will  rely  heavily  on  both  multitemporal  and  stereo 
analyses. 

In  summary,  the  intermediate  range  navigator's 
inputs  include  a  subset  of  the  long  range  map, 
models  of  known  obstacles,  models  of  gross  terrain 
features,  and  a  path.  This  navigator  must  be  capa¬ 
ble  of  general  image  analysis  for  segmentation  and 
recognition;  qualitative  ranging  (e.g.,  relative 
depth  by  occlusion;  motion  stereo;  dynamic  stereo 
from  image  flows);  landmark  acquisition  (anomaly 
detection);  and  path  (corridor)  planning.  Its 
internal  data  representation  includes  a  map  of 
corridors  of  free  space,  major  obstacles  and 
terrain  features. 

The  short-range  navigator  is  tasked  to  navi¬ 
gate  the  vehicle  through  a  single  corridor  of 
free  space  by  identifying  with  that  corridor  a 
track  of  safe  passage.  It  does  this  based  on  a 
combination  of  direct  ranging  and  multispectral 
analysis.  Direct  ranging,  either  from  a  time-of- 
f light  laser  or  a  specially  designed  structured 
light  sensor,  provides  the  navigator  with  a  low 
resolution  depth  map  of  the  terrain.  Based  on 
properties  of  the  vehicle,  the  short  range  navi¬ 
gator  would  then  identify  the  subspace  of  the  im¬ 
mediate  environment  through  which  the  vehicle  can 
move.  The  analysis  of  the  range  data  will  be  sup¬ 
plemented  by  a  terrain  composition  analysis  based 
on  multispectral  data  so  that  the  navigator  can 
distinguish  rock  from  vegetation  from  water,  etc. 

In  summary,  the  inputs  to  the  short  range 
navigator  include  reports  from  the  pilot  and  dead 
reckoning  subsystems,  and  a  corridor  of  free  space. 
It  must  be  capable  of  ranging  (using  a  laser 
scanner  or  structured  light  system),  terrain  compo¬ 
sition  assessment  (spectral/range  classification), 
failsafe  collision  avoidance  (e.g.,  using  acoustic 
proximity  sensors),  and  path  planning  of  specific 
tracks.  Its  internal  data  representation  includes 
a  dynamic  multi-resolution  terrain  map. 

A  specific  intermediate  range  navigation  task 
now  being  investigated  is  road  following.  A  sys¬ 
tem  for  carrying  out  this  task  will  make  extensive 
use  of  domain  specific  knowledge,  including  ground 
topography  and  3-D  road  models;  local  straightness 
and  parallelness  of  road  edges;  slow  variation  in 
road  geometry  (except  at  intersect ionsS) ;  and  dis- 
criminability  of  the  road  from  its  background 
(e.g.,  road  edges  usually  give  rise  to  intensity 
edges).  The  initial  step  in  road  following  is 
based  on  multi-resolution,  multi-perspective  edge 
detection.  From  an  initial  edge  picture,  obtained 
using  a  gradient  operator,  linear  features  are 
enhanced  using  local  support  or  anisotropic  motion 
blur.  Various  methods  are  then  used  to  select 
edges  of  interest,  including  verification  based 


on  texture  continuation;  consistency  over  scale  and 
perspective;  and  generalized  Hough  matching.  Know¬ 
ledge  about  the  road  (slowly  varying)  and  vehicle 
motion  allow  edges  to  be  predicted  in  subsequent 
views  —  in  other  words,  guided  search  can  be  used 
to  find  edges  in  views  from  new  perspectives.  The 
model  and  predictions  can  then  be  corrected  and  up¬ 
dated  . 

The  road  following  process  also  involves  other 
intermediate  range  navigation  tasks,  such  as  plan¬ 
ning  for  anticipated  road  changes;  negotiating  in¬ 
tersections  and  curves;  getting  on  and  off  roads; 
and  (eventually)  navigating  among  other  moving 
vehicles.  It  also  involves  short  range  tasks,  in¬ 
cluding  avoidance  of  obstacles  (with  the  aid  of 
range  data)  and  constraining  the  vehicle's  path 
to  the  road  corridor;  as  well  as  long  range  tasks, 
including  recognition  of  landmarks  along  the  road 
and  road  network  navigation. 
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INTRODUCTION 

The  Westinghouse  Image  Processing  Group  was  headed  by  Dr.  Glenn 
Tisdale  for  nearly  20  years.  Dr.  Tisdale  retired  in  August  1984.  We  would 
like  to  take  this  opportunity  to  describe  the  major  products  produced  by 
the  group  during  this  period  as  well  as  its  future  direction. 


stalled  at  NASA-Goddard  in  1979.  It  is  used  for  registering  LANDSAT 
image  pairs. 

An  AUTO-Q  I  unit  (figure  I)  was  built  in  1980  and  flight  tested  by  the 
Army  at  Fort  A.P.  Hill  in  1981  It  consists  of  an  image  preprocessor  and 
military  computer,  all  in  a  I  cubic  foot  box.  A  commercial  version  of  this 
unit  is  now  in  production.  It  uses  Multibus  size  boards  with  an  Intel 
single  board  computer  as  host. 

T\vo  AUTO-Q  II  units  have  been  built.  One,  delivered  to  the  Air 
Force,  accepts  linescan  imagery  at  a  12  megapixel  per  second  rate.  The 
other  processes  full  frame  video  or  875-line  FLIR  data  at  a  3.3  megapixel 
per  second  rate. 

AUTO-Q  is  composed  of  a  custom-designed  preprocessor  and  a  gen¬ 
eral-purpose  postprocessor.  The  preprocessor  algorithms  are  hardwired, 
although  certain  operations  may  be  enabled,  disabled,  or  modified  under 
software  control.  The  basic  tasks  of  the  preprocessor  are  (I)  digitizing 
the  analog  video  input,  (2)  storing  the  digitized  video  in  a  frame  buffer, 
(3)  conditioning  the  digital  video,  and  (4)  extracting  edge  and  blob  fea¬ 
tures  from  the  imagery.  The  postprocessor  is  basically  given  a  list  of  edge 
vectors,  blobs,  and  image  statistics.  From  these,  it  performs  some  type  of 
scene  analysis.  The  preprocessor  works  at  such  a  high  rate  that  the 
postprocessor  can  frame  integrate  features  and  control  the  settings  of  the 
preprocessor  via  feedback  loops. 

A  block  diagram  of  AUTO-Q  II  is  given  in  figure  2.  After  being  digi¬ 
tized  and  stored,  the  video  data  is  filtered  to  attenuate  noise.  Various 
filter  combinations  can  be  chosen  on  a  frame-by-frame  basis.  The  fil¬ 
tered  image  is  then  passed  through  a  programmable  contrast  enhancer. 


EARLY  SYSTEMS 

Work  in  teal-time  image  processing  began  at  the  Westinghouse  Defense 
Center  in  1968.  The  work  was  theoretical  until  1970  when  a  software 
simulation  of  a  system  began. 

A  near  real-time  brassboard  was  completed  in  1974.  The  brassboard 
filled  a  full  rack,  but  could  only  process  a  100-by-l00  pixel  window  once 
per  second.  The  locations  of  recognized  objects  were  indicated  via  an 
LED  array  surrounding  the  display.  The  system  used  an  analog  storage 
tube  to  capture  the  incoming  video.  This  proved  to  be  a  major  source  of 
noise  and  instability. 

An  advanced  breadboard  was  completed  in  1978.  Its  front  end  in¬ 
cluded  an  A/D  converter  and  a  i28-by-128  digital  frame  buffer.  It  ex¬ 
tracted  edges  and  blobs  from  the  digital  imagery  at  near  real-time  rates. 
In  this,  as  well  as  later  systems,  symbology  could  be  superimposed  over 
the  live  video  to  cue  recognized  objects. 

AUTO-Q-SYSTEMS 

Westinghouse's  AUTO-Q  line  of  processors  consists  of  four  different 
models.  A  scene  matching  system  (known  as  AUTO-MATCH)  was  in¬ 
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Figure  I.  AUTO-Q  Preprocessor  and  Postprocessor  are  Housed  in  a  I  Cubic  Foot  Box 
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PIPELINED  ARCHITECTURE  OF  AUTO  Q  PREPROCESSOR 


EDGE  VECTORS  BLOB  LOCATIONS  &  SIZES 

(LOC,  LENGTH,  ANGLE) 


ALL  STAGES  UNDER  SOFTWARE  CONTROL  FRAME-BY-FRAME 


Figure  2.  Block  Diagram  of  AUTO-Q  Preprocessor 


One  of  three  types  of  edge  detectors  (Roberts,  Sobel,  or  Hueckel)  is  then 
applied  to  the  smoothed  and  filtered  imagery. 

Subsequent  interest  is  focused  on  those  pixels  with  high  edge  values. 
Two  types  of  feature  detectors  are  applied  in  parallel.  One  uses  a  thinning 
and  linking  process  to  con*  long  edge  vectors.  Simultaneously,  a 
thickening  operation  is  applied  to  the  edge  image.  Blobs  are  then  ex¬ 
tracted  from  the  thickened  edge  image  by  a  border  following  operation. 
Up  to  2000  edge  vectors  of  various  lengths  and  256  blobs  can  be  extracted 
for  each  processed  frame.  The  processing  rates  of  the  different  AUTO-Q 
models  are  given  in  figure  3. 

VHSIC  SYSTEMS 

The  future  of  image  processing  at  Westinghousc  lies  with  submicron 
VHSIC  boards  -  both  specially  designed  image  processors  and  generic 
signal  processors.  Multispectral,  multisensor  image  processors  will  be  the 
norm  in  the  1990’s.  Meanwhile,  older  models  will  continue  to  go  into 
commercial  production  as  part  of  the  Defense  Center's  Technology 


Transfer  Program.  The  following  information  is  taken  from  an  Aviation 
Week  article  of  July  30,  1984. 

Westinghouse's  Phase  I  VHSIC  chip  set  consists  of  five  devices  which 
serve  as  the  building  blocks  for  hierarchical  multiprocessor  systems.  The 
five  chips  are  Pipeline  Arithmetic  Unit,  Extended  Arithmetic  Unit.  En¬ 
hanced  Arithmetic  Unit  Multiplier.  General  Purpose  Controller,  and  64K 
Static  RAM.  The  Westinghouse  team  is  the  only  one  of  the  six  Phase  1 
contractors  to  use  Complementary  Metal-Oxide  Semiconductor  (CMOS) 
technology  for  all  members  of  its  chip  set  -  a  technology  that  is  the  lead¬ 
ing  contender  for  the  follow-on  Phase  2  effort.  One  VHSIC  brassboard 
being  built  under  the  Phase  I  contract  is  an  electio-optical  image  proces¬ 
sor  for  the  Ml  tank. 

There  is  growing  confidence  that  the  much  more  challenging  objectives 
of  Phase  2  calling  for  0.5  micron  feature  size  and  a  further  100-fold 
increase  in  throughput  can  be  achieved  in  the  VHSIC  Phase  2  effort.  It  is 
our  belief  that  processing  rates  will  be  sufficient  for  real-time  image  un¬ 
derstanding  before  the  decade  is  over. 
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Name 

Technology 

Year 

Number 
of  Units 

Ops/s 

512  x  512 
Frames/s 

Customers 

AUTO-Q  I 

MSI  TTL 

1980 

45 

400  Million 

3 

Army,  AF,  NASA 
MIT,  SU,  DEC,  etc. 

AUTO-Q  11 

MSI  TTL 

1983 

2 

1  Billion 

13 

AF,  Army 

VHS1C  <t>  1 

1.25  nm  CMOS 

1985 

* 

* 

• 

Army 

VHS1C  02 

•Stars  Denote 

0.5  fim  CMOS 

Information  Cannot  Be  Published  at 

1988 

t  his  Time 

* 

* 

* 

Proposal  Submitted 

Figure  3.  Vita!  Statistics  of  Westinghouse’s  Image  Processors 
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Abstract 

We  report  here  on  several  implementations  of 
computing  architectures  for  use  in  array-based  2-0 
image  processing  applications.  First  we  will 
describe  our  linear  systolic  array,  which  has  been 
built  using  a  custom  designed  VLSI  Multiplication 
£riented  Processor  (MOP)  chip  as  tTie  processing 
element.  It  is  capable  of  performing  OFTs,  1-D  and 

2- 0  convol ut ions ,  and  solving  Toeplitz  linear 
systems.  A  second,  more  general  systolic  array  we 
are  building,  is  based  on  the  Faddeev  algorithm. 
This  machine,  which  is  ultimately  intended  to  be  a 
general  linear  algebraic  processor,  is  presently 
capable  of  matrix  operations,  linear  system 
solution,  matrix  factorization ,  and  least  squares 
solutions.  Finally,  we  will  briefly  describe  our 

3- D  computer,  which  combines  wafer  scale 
integration,  bus  lines  through  wafers,  and 
inter-wafer  connections,  to  provide  a  computing 
capability  with  several  orders  of  magnitude 
combined  improvement  in  power  dissipation, 
throughput,  size  and  cost,  when  compared  to  present 
special  purpose  computers. 

Introduction 

In  this  paper  we  describe  three  computing 
architectures,  which  vary  from  very  special  purpose 
with  a  relatively  simple  implementation  to  more 
general  purpose  (in  terms  of  image  understanding) 
with  a  complex  implementation.  In  Section  I  we 
discuss  our  linear  systolic  array  and  give  an 
example  of  how  it  can  be  used  to  solve  a  special 
type  of  linear  system  in  0(n)  time  steps  using  only 
0 ( n)  processing  elements  (PEs).  In  Section  II  we 
describe  a  2-0  array  of  PEs  capable  of  much  more 
general  operations  in  0(n)  time  steps  using  an 
0 ( n  )  array  of  PEs.  Most  of  the  discussion  will 
center  on  the  Faddeev  algorithm,  since  it  is  the 
architectural  basis  of  this  array.  Finally,  we 
will  briefly  describe  our  3-0  computer  and 
summarize  its  capabilities  in  Section  III. 

I .  Linear  Systolic  Array 

The  linear  systolic  array  architecture  has  been 
recognized  as  a  relatively  inexpensive  approach  to 
providing  increased  throughput  in  a  number  of 
important  application  areas.  [1]  Such  arrays  are 
easily  expandable,  relatively  simple  to  design  and 
interface,  while  offering  a  surprisingly  large 
range  of  calculational  capabilities.  We  describe 
here  a  design  that  uses  a  custom  VLSI  chip  as  a  PE 


which  has  been  built  in  an  optimum  way  for  linear 
systolic  arrays.  In  addition  we  have  integrated  on 
to  the  chip  a  special  high  speed  multiplier  [2] 
which  is  far  more  area-time  efficient  than 
conventional  fast  parallel  multipliers. 

Architecture 

Our  prototype  system,  shown  in  Figure  1, 
consists  of  nine  replicated  copies  of  our  custom 
MOP  chip,  plus  a  commercial  divider.  (Each  MOP 
chip  serves  as  a  PE.)  Eight  of  these  MOP  chips  are 
used  in  the  systolic  array  aid  one  (MOP  AUX)  serves 
as  a  fast  buffer  memory  for  che  rest  of  the  array. 
Since  many  algorithms  for  linear  systolic  arrays 
result  in  only  alternate  processors  active  at  any 
time,  we  effectively  time  multiplex  operations  so 
that  all  processors  can  be  active  all  the  time. 
For  example  the  eight  MOP  chips  can  be  used  to 
solve  a  16th  order  Toeplitz  linear  system. 


Figure  1.  System  block  diagram  of  linear  systolic 
array.  The  inputs  Ci ,  to  the  MOP  chips 
are  used  to  gate  control  signals  to  each 
chip. 


Instructions  are  broadcast  to  all  PEs  by  our 
control  unit  (Tektronix  9163  system).  Each  PE  has 
a  gate  associated  with  it  to  control  receipt  of 
instructions.  In  some  “wavefront"  type  algorithm 
implementations  this  is  necessary  to  prevent  PE's 
which  are  not  part  of  the  wavefront  from  receiving 
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instructions.  As  can  be  seen  in  Figure  2,  there  is 
one  PE  per  board.  All  the  boards  are  plugged  into 
alternate  slots  in  a  VME  bus.  This  organization 
provides  simple  diagnostic  means  for  debugging  the 
operation  of  this  prototype  linear  array. 


Figure  2.  Photograph  of  linear  systolic  array. 

Each  board  is  a  PE  (one  MOP  chip).  Most 
of  the  TTL  circuitry  is  used  to 
implement  and  interface  the  divide 
function,  which  is  performed  by 
commercial  chip. 


We  have  written  an  assembler  along  with  code  to 
down  load  programs  from  our  11/34  host  computer. 
There  is  presently  an  RS-232  link  to  our  host 
computer ,  which  will  be  replaced  by  a  more  direct 
link  in  the  near  future. 

Multiplication  Oriented  Processor  Chip 

Considering  the  large  number  of  applications  of 
systolic  type  processor  arrays,  it  would  be 
difficult  to  design  a  single,  generic  custom  chip 
containing  the  entire  spectrum  of  desirable 
attributes  without  significant  sacrifices  in 
performance  and  chip  area.  As  an  alternative 
approach,  we  have  used  a  design  methodology  which 
offers  a  generic  type  of  chip  that  can  be  easily 
designed  to  fit  a  specific  application  or  range  of 
applications.  This  generic  capability  is  a  result 
of  a  number  of  features  associated  with  the  way  the 
chip  is  organized.  In  this  section  we  discuss  the 
MOP  chip  as  an  example  of  sue!  a  generic  approach 
to  providing  PEs  for  systolic  arrays. 

In  order  to  provide  the  necessary  flexibility 
for  data  movement  within  and  between  chips,  we 
selected  a  bus  oriented  and  bit  slice  architecture. 
Processor  arrays  generally  need  large  data 
bandwidths  and  flexibility  in  I/O  capabilities.  To 
support  these  needs  we  have  provided  the  MOP  with 
two  bidirectional,  tri-state,  parallel  I/O  ports. 
Each  of  these  has  two  output  latches  and  two  input 
latches  and  can  send  or  receive  two  words  per  clock 
cycle. 


The  speed  of  the  MOP  chip  has  been  enhanced  by  a 
number  of  features.  First,  it  uses  dual  buses  so 
that  two  operands  can  be  transferred  per  clock 
cycle.  All  arithmetic  algorithms  have  the 
necessary  control  embedded  in  hardware.  This  adds 
considerable  speed  over  a  microcoded  approach  and 
eases  the  controller  design  problem.  The 
multiplier  (and  future  divide  and  square  root) 
circuitry  runs  on  its  own  set  of  high  speed  clocks 
that  are  separate,  but  in  synchronism  with,  the 
slower  system  clocks.  The  multiplier  uses  a  radix-4 
algorithm  for  an  additional  factor  of  two  speed-up 
over  conventional  algorithms  and,  finally,  has  a 
three  stage  pipeline  to  allow  very  high  partial 
product  accumulation  speeds. 

The  MOP  chip,  shown  in  Figure  3,  also  has  10 
general  purpose,  dual  port,  28-bit  registers,  two 
16-word  deep  LIFO  stacks,  and  a  Manchester,  type 
ripple  adder.  The  chip  size  is  290x250  mils2,  and 
power  consumption  is  0.75  to  2W,  depending  on 
processing  parameters.  The  system  clocks  were 
designed  for  4MHz  operation  and  the  high  speed 
multiplier  clocks  are  intended  to  be  run  at  16MHz. 


Figure  3.  Photograph  of  MOP  chip  showing  important 
regions. 


Systolic  Implementation  of 
Toeplitz  System  Solution 

We  have  looked  at  a  number  of  algorithms  that 
can  be  implemented  on  a  linear  systolic  array 
(e.g.,  splines,  1-D  and  2-D  convolutions).  We 
describe  here  briefly  one  of  these:  a  Toeplitz 
linear  system, 

H  X  =  Y  (1) 

where  H  is  a  Toeplitz  matrix.  This  system  can  be 
solved  by  using  the  Levinson  algorithm  [3,4] 

X  =  u'VuVW 

where  U  is  an  upper  triangular  matrix  and  D  a 
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Figure  4.  Diagram  of  linear  systolic  array  system  (top)  and  functional  layout  of  MOP  chip  (bottom). 
The  lattice  array  generates  the  elements  of  U.  Switches  Sj  are  closed  and  S2  open 
during  the  lattice  array  and  first  backsubstltution  operations.  During  the  second 
backsubstitution  operation  S2  Is  closed  and  Sj  Is  open.  Calculations  performed  by  the 
processors  are  Indicated  In  their  respective  boxes. 


diagonal  matrix  with  elements  u..,  u_,  .  •  •  unn' 
To  perform  this  operation  on  our  systolic  array  "He 
following  successive  steps  are  necessary: 

Generation  of  Ut 

First  Back  Substitution  A  =  [U1]  *C 

Second  Back  Substitution  X  =  U  1  A. 

The  essential  arithmetic  elements  in  this 
process  are  high  speed  multiplication  and  a  LIFO 
data  reflection  to  perform  the  transpose.  The 
linear  array  configuration  for  this  is  shown  in 
Figure  4.  This  array  provides  0 ( n )  speed  using 
0 ( n )  PEs.  A  complete  solution  to  a  16th  order 
Toeplitz  system  requires  approximately  175  usee. 

II.  Linear  Algebraic  Processor 

One  approach  to  image  analysis  has  been  to 
relate  such  problems  to  specific  concepts  in  linear 
algebra  and  matrix  theory.  [5]  Therefore,  a 
concurrent  processor  capable  of  performing  a 
general  class  of  linear  algebraic  operations  would 


be  of  great  value  here.  However,  one  of  the 
principal  problems  encountered  in  designing 
concurrent  computing  systems  is  that  of  providing  a 
sufficiently  general  range  of  capabilities  without 
undue  addition  of  hardware  and  interconnection 
requirements.  What  we  describe  here  is  an 
algorithm  based  on  that  of  Faddeev  [6,7]  which 
offers  a  systematized  means  for  performing  a 
variety  of  matrix  operations.  It  can  be  adapted  to 
use  on  existing  2-D  concurrent  computing  arrays  or 
as  the  basis  for  a  computing  architecture  in 
itsel  f . 

We  will  first  describe  the  Faddeev  algorithm, 
highlighting  its  advantages  and  deficiencies  for 
performing  matrix  computations.  We  will  then  show 
how  it  can  be  modified  to  provide  a  wider  range  of 
capabilities  with  improved  numerical  stability. 
Finally  we  will  discuss  its  application  to  an 
important  set  of  problems  (least  squares). 

Faddeev  Algorithm 

To  illustrate  the  Faddeev  algorithm  we  consider 
the  simple  case  of  finding 
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that  numerical  accuracy  is  comparable  to  the  usual 
Lll  decomposition  and  backsubstitution . 


Vl  +  C2X2  +  •••  +  Vn  +  d* 

where  cj»c?»***»cn  and  are  given  numbers,  and 
W'***  n  's  so1ut'on  to  the  ^™ear  system 
ot  equations 

allxl  +  a 1 2  x2  +  •••  +  alnxn  =  bl 
a21xl  +  a22X2  +  +  a2nxn  =  b2 


a  .x.  +  a„oxo  +  •••  +  a„„x„  - 
nl  1  n2  2  nn  n  n 


whose  determinant  is  non-zero.  The  problem  can  be 
codified  by  writing  it  as 


a  1 1 3 1 2  •' 

'•  aln 

bi 

a21a22  •' 

••  a2n 

b2 

anlan2 

'  *  ann 

bn 

-crc2  •• 

. .  -c 

n 

d 

or  in  abbreviated  form, 


-C  |d  , 

where  B  is  a  column  vector  and  C  is  a  row  vector. 
If  a  suitable  linear  combination  of  the  rows  above 
the  line  (from  A  and  B)  are  added  to  the  row 
beneath  the  line  (e.g.,  -C+MA  and  d+MB,  where  U 
specifies  the  appropriate  linear  combination),  so 
that  only  zeroes  appear  in  the  lower  left  hand 
quadrant,  then  the  desired  result,  CX+d,  will 
appear  in  lower  right  hand  quadrant.  This  follows 
because  the  annulment  of  the  lower  left  hand 
quadrant  requires  that 


so  that 


d+MB=d+CA  B. 


Since  X=A  B,  we  have  the  final  result 


d+WB=d+CX. 


The  simplicity  of  the  algorithm  is  due  to  the 
absence  of  a  necessity  to  actually  identify 
multipliers  of  the  rows  of  A  and  the  elements  of  B; 
it  is  only  necessary  to  "annul  the  last  row."  This 
can  be  done  by  ordinary  Gaussian  elimination.  An 
important  feature  of  this  algorithm  is  that  it 
avoids  the  usual  backsubst ’tution  or  solution  to 
the  triangular  linear  system  and  obtains  the  values 
of  the  unknowns  directly  at  the  end  of  the  forward 
course  of  tie  computation,  resulting  in  a 
rens  in.  r  ;f  '  r  savings  in  added  processing  and 
sterjju.  Sin*  -  u  t  i  r  a  1  studies  we  have  done  show 


This  result  can  be  generalized  to  the  case  of 
rectangular  matrices  C,D,  and  B,  or 


After  thejower  left  hand  quadrant  is  annulled,  the 
result  CA  ^+0  will  appear  in  the  lower  right  hand 
quadrant.  As  shown  in  Figure  5,  numerous  matrix 
operations  are  possible  by  selective  entries  in  the 
four  quadrants. 

POSSIBLE  OPERATIONS 


A  I 
-I  O 

I  B 

-c  o 

i  |b 

-c|d~ 

a|b 

-I  o 

A  B 
C  D 


D  +  CB 


A_1B 


CA_1B  +  D 


Figure  5.  Illustration  of  possible  matrix 
operations  using  Faddeev  algorithm. 


Modified  Faddeev  Algorithm 


Although  the  Faddeev  algorithm  has  some  very 
desirable  features,  we  would  like  to  add  an 
orthogonal  factorization  capability  for  added 
numerical  stability  and  to  permit  the  coefficient 
matrix  to  tip  non-square  for  over-  and 
underdetermined  systems  of  equations. 
Unfortunately,  when  Givens  rotations  are  applied  to 
the  matrix 
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in  the  usual  way  (beginning  in  the  lower  left  hand 
corner)  to  annul  C,  the  result  is 


0  -  [Oj 


)  [QjB 


-YC  +  WA  YD  +  UB 


where  A  is  mxn,  R  is  upper  triangular,  Q  is 
orthogonal,  and  Y  is  a  matrix  resulting  from 
rotations  on  elements  of  C.  Since  the  lower  left 
hand  quadrant  results  in  W  =  YCA'1 ,  we  find  that 
the  lower  right  hand  quadrant  becomes  Y(D+CX). 
Therefore,  the  mixing  of  the  rows  beneath  the  line 
during  the  rotation  process  causes  the  incorrect 
result  to  appear.  For  this  reason  it  is  necessary 
to  divide  the  process  of  annulling  the  lower  left 
hand  quadrant  into  a  two  step  procedure.  First  A 
is  triangularized  by  Givens  rotations 
(simultaneously  applied  to  8) ;  after  this  is 
completed  the  remainder  of  the  process  can  be 
accomplished  by  Gaussian  elimination  using  the 
diagonal  elements  of  R,  all  of  which  must  be 
non-zero  if  A  is  full  rank,  as  pivot  elements.  In 
other  words  after  the  first  step  (Givens  rotations) 
we  obtain 


and  after  the  second  step,  Gaussian  elimination, 
the  final  result  is 

Rift®! 


0  [Q2B 


0  C  R'^B+D 


There  are  no  restrictions  here  on  the  coefficient 
matrix  A  other  than  it  be  full  rank.  Thus,  as  will 
be  seen  later,  least  squares  problems  can  be  solved 
in  addition  to  the  matrix  manipulation  capabilities 
shown  in  Figure  5. 

Architectural  Considerations 

As  mentioned  in  the  description  of  the  Faddeev 
algorithm,  one  of  its  nice  properties  is  that  it 
maps  well  into  an  array  architecture  with  data 
flowing  in  nearest  neighbor  fashion.  One  approach 
would  be  to  use  a  triangular  array  [8]  for  this 
purpose.  In  order  to  correctly  process  the  B 
matrix  it  is  only  necessary  to  extend  the 
triangular  matrix  in  the  eastward  direction  as 
shown  in  Figure  6.  By  passing  A  and  B  down  through 
this  array  with  delays  as  shown,  and  performing  the 


Figure  6.  Processor  flow  diagram  and  PE 

arrangement  for  first  step  (orthogonal 
triangularization)  of  modified  Faddeev 
algorithm. 
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computations  indicated  in  the  circles  and  boxes,  R 
and  Q,B  will  be  left  stored  in  the  array  of  PEs. 
(In  this  example  both  A  and  B  are  mx4  in  s i ze . ) 

The  second  step  in  the  modified  Faddeev 
algorithm  could  be  accomplished  as  shown  in 
Figure  7.  Here  C  and  D,  each  of  which  is  nx4,  are 
also  passed  down  through  the  array  of  processing 
elements  in  a  similar  way.  In  this  case  the  set  of 
operations  performed  in  each  PE  is  slightly 
different  as  shown.  The  PEs  indicated  by  the 
circles  each  zero  one  column  of  C  by  pivoting  on 
the  diagonal  elements  of  R.  Izi  fcth i s  case  after 
0(n)  time  steps  the  result  CR  1 Q , B+0  will  appear 
row  by  row  coming  out  of  the  array  at  the  bottom 
right . 

The  triangular  structure  of  Figures  6  and  7  can 
be  easily  transformed  to  a  square  organization  for 
more  efficient  implementation  and  generalization  to 
handle  arbitrary  sized  matrices. 


Least  Squares  Problem 


As  an  example  of  the  use  of  the  Faddeev 
algorithm  we  will  show  how  it  can  be  applied  to  the 
least  squares  problem  [9]  of  finding  the  value  of  x 
that  minimizes 


lAx- b|l. 


The  usual  procedure  is  to  perform  an  orthogonal 
tr iangularization  of  the  mxn  (m>n)  matrix  A,  which 
for  the  overdetermined  case  leads  to 


so  that 


H  Ax-bl|  =  ||  QtAx-Qtb| 


|  Ax-b||  =  ||  Rx-Q^bf  +  ||  0*4 


The  minimun  val  ue  of  ||Ax- b[| is  obtained  with  x  as  the 
solution  to  Rx=Qfb.  The  residual  is  then  Qjb. 
These  results  can  be  found  using  the  modified 
Faddeev  algorithm  with  the  data  arrangement 


-10. 


For  example  the  processor  arrangement  shown  in 
Figures  6  and  7  would  be  suitable  for  computing  the 
least  squares  solution  to  min  ]|  Ax  .  -  b  .  |(  for 
i=l,2,3,4,  where  A  is  mx 4.  Note,  as  'shown  in 
Figure  6,  the  residuals  are  very  simply  obtained  by 


Figure  7.  Processor  flow  diagram  and  PE 

arrangement  for  second  step  (Gaussian 
elimination).  The  output  matrix  is 
G=CR‘Vb+D. 


accumulation  of  the  squares  of  the  first  m-4 
non-zero  outputs  (corresponding  to  the  elements  of 
Ojb)  of  each  of  the  columns  associated  with  b... 

For  the  underdeterm ined  system,  where  A  is  mxn 
with  m<n,  we  find  after  factoring  that 

jj  Ax-b|j  =  ||  [R:0]Otx-b|( 

If  we  let 


_y2J 


then  y,  can  be  found  from  the  solution  to  the 
triangular  system 

Ryj=b 

and  y_  is  arbitrary.  The  usual  procedure  is  to  set 
y_=0,  which  corresponds  to  taking  the  minimum  norm 
solution.  The  underdetermined  case  requires  that  Q 
be  applied  after  the  solution  for  y.,  so  that  the 
rotations  must  be  accumulated  during1  the  course  of 
the  computation.  This  problem  can  be  solved  using 
the  modified  Faddeev  algorithm  with  the  data 
entries 


At  the  end  of  the  calculation  the  entries  will  be 


Figure  8.  Basic  structure  of  the  3-0  computer. 


equal  to  the  number  of  pixels  in  the  information 
plane,  allowing  the  assignment  of  a  complete 
processor  to  each  pixel.  This  organization  is  in 
direct  correspondence  with  (and  is  therefore  more 
efficient  for  processing)  2-D  arrayed  data. 
However,  it  should  be  emphasized  that  the 
application  of  such  a  machine  is  not  limited  to 
two-dimensional  data,  and  a  very  wide  variety  of 
computationally  intensive  applications  can  be 
performed  by  the  structure  with  considerable 
advantage. 


[>!  :  0]  (T 

where  the  desired  result  x*’  is  in  the  lower  right 
hand  quadrant.  Of  course  multiple  underdetermined 
least  squares  cal  culat  ions ,  min  jjAx.-b^  ||,  for 
i*l,2,..n,  could  be  performed  with  the  entries 


A 

I 

-B 

0 

where  B=[b,:b ]  contains  the  set  of  right 
hand  side  vectors,  from  an  architectural  point  of 
view  no  extra  PEs  are  necessary  when  processing 
more  than  one  right  hand  side  vector. 

Ill .  3-D  Computer 

The  3-D^computer, concept  consists  of  a  large 
number  (10*  to  10°)  of  parallel  computing 
channels  [10].  Typically  the  number  of  channels  is 


The  3-D  parallel  processor  structure  can  be 
visualized  as  a  stack  of  large  silicon  wafers  lying 
on  top  of  each  other  like  a  stack  of  coins 
(Figures  8  and  9).  Each  wafer  is  divided  into  an 
N  x  N  array  of  primitive  computing  cells.  The 
signal  is  transferred  through  the  wafer  at  each 
cell  and  then  interconnected  to  the  corresponding 
cell  on  the  adjacent  wafer.  In  this  way,  N  x  N 
signal  paths  (bus  lines)  penetrate  the  stack  of 
wafers.  Each  of  these  N  x  N  data  lines  serves  as 
the  main  data  path  of  a  primitive  serial  computer. 


Figure  9.  Organization  of  3-D  computer  as  shown 
from  "side"  view. 


Functional  units  of  each  computer  are  arranged 
along  these  serial  data  buses.  Each  wafer  in  the 
stack  contains  an  N  x  N  array  of  computing  elements 
of  one  type  (such  as  memories,  accumulators,  as 
shown  in  Table  1),  one  such  element  for  each  of  the 
N  x  N  data  lines.  The  idea  is  to  put  a  stack  of 
these  primitive  computing  elements  behind  each 
pixel  or  matrix  element,  thereby  providing  a  simple 
computer  for  each  element  of  the  incoming  data 
structure . 


Table  1.  List  of  Cell  Types  in  the  3-D  Computer. 
FUNCTION 

Store,  shift,  invert/non  invert, 
’’OR,"  Tull  word/MSB  only, 
destructive/non  destructive 
read  out 

Store,  add,  full  word/MSB  only, 
destructive/non  destructive 
read-out 


range-Doppler  computation,  spotlight  SAR,  and 
matrix  computations  such  as  matrix  inversion  and 
multiplication.  In  Table  2  we  summarize 
computation  times  for  some  primitive  operations. 


Table  2.  Processing  times  for  various  primitive 
operations  (10-MHz  clock) 


OPERATION 

TIME 

Data  move  (MEM  -> 

MEM) 

1. 

,8 

uS 

ADD  (ACC  +  MEM  -> 

MEM) 

1. 

.8 

pS 

MULTIPLY  (ACC  X  MEM  ->  MEM) 

42, 

.2 

MS 

DIVIDE 

(ACC  :  MEM 

->  ACC) 

127. 

.1 

MS 

SQUARE 

ROOT  (  /ACC  ->  ACC) 

152. 

.6 

MS 

Sobel 

edge  operator 

54. 

.3 

MS 

256  x 

256  matrix 

multiply 

12. 

.0 

ms 

256  x 

256,  8-bit 

histogram  2 

1, 

.7 

ms 

256  x 

256  matrix 

inversion 

10. 

.2 

ms 

CELL  TYPE 
Memory 

Accumulator 


Replicator  I/O,  X/XY  short,  stack/ 

Plane  control  unit  communication 


Counter  Count  in/shift  out 

Comparator  Store  (reference), 

greater/equal /lower 


Due  to  the  enormous  number  of  processing 
elements  that  must  be  contained  on  a  single  silicon 
wafer,  the  actual  area  available  to  each  such 
element  is  quite  small,  presently  on  the  order  of 
20  mils  x  20  mils.  This  places  a  strict  limitation 
on  the  number  of  components  available  with  which  to 
construct  the  elements.  This  restriction  is 
overcome  by  dispersing  the  various  functions  of 
each  computer  vertically  throughout  a  stack  of 
wafers.  In  a  typical  example,  the  array  may 
consist  of  256  rows  and  256  columns,  for  a  total  of 
65,536  identical  computers.  Each  of  these 
computers  would  have  its  functions  distributed  over 
a  vertical  column  extending  through  20  or  more 
wafers . 


Our  approach  here  for  a  prototype  machine  is  to 
provide  a  full  2:1  redundancy  on  every  wafer  plane. 
In  this  case,  there  will  be  two  computing  cells, 
four  feedthroughs,  and  four  interwafer  contacts  for 
each  pixel  element  in  the  image.  Assuming  0.99 
yield  on  a  12  x  12  mil  unit  cell,  we  have  computed 
the  yield  on  VLSI  arrays  of  different  size  both 
with  and  without  2:1  redundancy  and  disconnect-type 
repairs.  The  result  is  shown  in  Table  3.  Without 
the  redundancy,  the  yield  on  a  450  x  450  mil  array 
would  be  practically  zero.  But  with  2:1  redundancy 
and  disconnect-type  repairs,  the  yield  stays 
reasonable  up  to  a  1-in.  chip.  The  assumption  in 
this  case  is  that,  if  both  cells  on  any  pixel  are 
defective,  the  chip  will  be  thrown  away.  If  one  is 
willing  to  make  10  or  fewer  discrete  wirings  on  a 
wafer,  which  means  being  willing  to  repair  a  wafer 
with  10  or  fewer  defective  cell  pairs,  one  can 
substitute  a  good  working  cell  from  a  next  neighbor 
pixel  that  has  both  cells  good  for  the  two 
defective  cells.  In  this  case,  a  93%  yield  can  be 
achieved  on  a  3. 6- in.  array. 


In  the  usual  mode  of  operation,  only  two  of  the 

wafers  in  the  stack  will  be  simultaneously  active,  Table  3.  VLSI  yield  comparison, 

one  functioning  as  the  source  of  the  data  being 
processed,  the  other  performing  the  processing  and  ' 
acting  as  the  repository  of  the  results  of  that 
computation.  Although  serial  arithmetic  means  that 
the  individual  cellular  computers  are  fairly  slow, 
the  massive  parallelism  of  the  array  results  in  an  ARRAY  SIZE 
enormous  overall  processing  power.  Benefits  of  the  (MILS) 

serial  arithmetic  include  the  simplicity  of  the  _ 

processing  elements  and  the  fact  that  the 


processing  occurs  simultaneously 

with  the 

memory 

225 

X 

225 

0.076 

0.957 

1 

access.  As  a  result,  the  memory 

bandwidth 

of  the 

450 

X 

450 

0.00003 

0.903 

1 

3-D  computer  is  always  matched 

to  the  processor 

900 

X 

900 

0 

0.664 

1 

bandwidth . 

1800 

X 

1800 

0 

0.194 

1 

3600 

X 

3600 

0 

0.0014 

0.93 

2:1  REDUNDANCY 
2:1  AND  10  SIMPLE 
REDUNDANCY  DISCONNECT 
NO  &  DISCONNECT  -CONNECT 

REDUNDANCY  REPAIR  ONLY  REPAIRS 


Numerous  processing  algorithms  have  been 
simulated  on  the  3-D  architecture,  including  object 
identification  and  analysis,  cueing  routines,  radar 
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The  projected  cost  of  the  3-D  computer  is  low. 
Most  of  the  cost  for  present  computers  is  for 
building  the  computer:  printed  boards,  wiring,  etc. 
In  the  3-D  computer,  most  of  these  costs  are 
eliminated.  It  is  not  necessary  to  dice,  bond,  or 
package  the  chips;  fabricate  the  printed  boards; 
solder  the  packages  onto  the  printed  boards;  or 
wire  the  printed  board  connectors  together.  The 
3-D  assembler  simply  places  the  wafers  one  on  top 
of  the  other.  The  majority  of  the  fabrication  time 
will  be  spent  on  testing,  and  even  that  will  be  a 
minor  effort  compared  to  that  encountered  with  more 
conventional  architectures. 


In  summary,  the  3-D  computer  offers  several 
important  features: 

o  Very  high  data  throughput  (>10^ 
instructions/sec) 

o  Very  low  power  ( <30  W) 

o  Extremely  small  size  (<6  in.'*) 

o  Potentially  low  cost. 

These  specifications  were  calculated  assuming  an 
array  of  128  x  128  cells  using  presently  available 
1:1  photolithography  techniques. 
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Abstract 

This  paper  describes  the  organization  and  operation 
of  a  semantic  network  array  processor  (SNAP).  The 
architecture  consists  of  an  array  of  identical  cells  each 
containing  a  content  addressable  memory,  microprogram 
control  and  communication  unit.  The  applications 
discussed  in  this  paper  are  discrete  relaxation  and 
dynamic  programming  for  stereo. 

1.  Introduction 

The  nature  of  symbolic  processing  used  in  artificial 
intelligence  (AI)  is  different  in  many  ways  from 
conventional  programming  language  processing. 
Consequently,  the  architecture  of  computers  intended  for 
AI  applications  should  be  different  from  today's 
commonly  used  von  Neumann  computers.  The  mapping 
of  AI  algorithms  into  architectures  cannot  be  done  with 
the  same  efficiency  as  that  of  numerical  signal  processing 
algorithms  (mapping  into  systolic  arrays,  for  example). 
Communication  networks  supporting  packet  switching 
and  complicated  data  transfer  protocols  are  necessary. 

The  vision  ‘algorithms  range  from  very  low  level 
number  crunching  to  symbolic  processing.  It  is  not 
possible  to  efficiently  implement  all  these  algorithms  on  a 
single  machine.  SNAP  (.Semantic  Network  Array 
fYocessor)  currently  under  study  at  USC,  addresses  the 
high  end  of  the  vision  processing.  In  this  paper  we  show 
how  SNAP  can  be  effectively  used  for  descrete  relaxation 
and  dynamic  programming  for  stereo.  The  interested 
reader  should  see  [8]  for  other  symbolic  processing 
applications  such  as  pattern  search,  inference,  and 
production  systems. 

In  this  paper  we  first  present  briefly  the 
architecture  and  the  instruction  of  SNAP,  and  then  show 
how  discrete  relaxation  and  dynamic  programming  can  be 

The  rr«i-;irrh  was  supported  by  DARPA  contract  No 
F-  >,  trii.VSJ-K-lTRB  and  F-33G15-84-K-1401 


processed  on  SNAP. 

2.  SNAP  Architecture 

The  architecture  consists  of  a  square  array  of 
identical  processing  cells  which  are  interconnected  both 
globally  and  locally  as  shown  in  Figure  2-1. 


HOST  COMPUTER 


Figure  2-1:  Architecture  of  SNAP 


Its  functionality  rests  upon  two  underlying  concepts-, 
associative  processing  and  cellular  array  processing.  Each 
cell  contains  memory  control  logic  and  communication 
logic.  As  a  whole,  the  array  is  operated  by  an  outside 
controller  which  also  provides  an  interface  between  SNAP 
and  a  host  computer.  Our  intent  was  to  minimize  the 
role  of  the  global  functions  which  affect  the  entire  array 
and  to  provide  more  operational  freedom  for  each 
individual  cell.  The  cells  can  be  microprogrammed  so 
iliey  can  operate  independently.  The  signals  involved  in 
the  inter-cell  communications  are  propagated  from  a  cell 


to  another  neighboring  cell  via  local  buses.  As  a  result, 
any  cell  can  communicate  through  a  chain  of  intermediate 
cells  to  any  other  cell  in  the  array.  A  cell's  address  is 
specified  by  its  row  number  followed  by  its  column 
number.  Information  in  a  particular  cell  can  be  retrieved 
either  by  its  content  (as  in  associative  memories)  or  by 
that  cell's  address.  The  associative  processing  concept 
alone  is  not  sufficient  for  designing  efficient  architectural 
structures  for  Al.  This  is  because  retrieval  operations  in 
A1  are  more  complex  than  simple  words;  most  frequently 
we  need  to  match  subgraphs  or  other  patterns.  Also,  we 
need  to  pursue  several  hypotheses  in  parallel,  and  this  is 
not  possible  with  simple  associative  processors.  This 
architecture  blends  the  power  of  associative  memories  for 
performing  fast  information  retrievals  with  the  capability 
of  cellular  array  to  process  various  tasks  in  parallel.  The 
regularity  of  the  array  is  an  important  feature  which 
makes  feasible  its  VLSI  implementation. 

2.1.  Architecture  of  the  Cell 

Each  cell  contains  a  content  addressable  memory 
(CAM),  processing  unit  (PU),  and  a  communication  unit 
(CU)  as  shown  in  Figure  2-2.  The  cell  also  has  some 
general  purpose  registers  and  flags.  Processing  unit  has 
simple  microinstructions  such  as  AND,  OR,  NOT, 
RESET  (a  flag),  MASK  (a  field  of  a  string),  MATCH  (a 
string).  The  role  of  the  CU  is  to  perform  data  transfers 
between  cells.  The  data  is  transferred  in  packets  via  local 
connections.  The  CU  has  an  input  queue  which  will  hold 
the  packet  until  it  can  claim  the  attention  of  its  neighbor. 
Exact  route  taken  by  a  packet  to  reach  its  destination 
depends  on  the  routing  algorithm  and  the  actual  traffic. 
The  CU  also  provides  access  to  the  cell  via  the  glob  il  ous. 
The  global  bus  is  used  mainly  by  the  host  computer  for 
instruction  and  data  broadcast  and  data  retrieval 


Figure  2-2:  Architecture  of  SNAP  Cell 


2.2.  Data  Representation 

A  semantic  network,  here,  is  thought  of  as  a  colored 
graph.  The  nodes  of  the  network  represent  objects  or 
concepts  and  the  arcs  represent  the  relations  between  the 
objects.  Usually,  one  cell  is  assigned  to  each  node  of  the 
semantic  network.  The  relations  associated  with  that 
node  are  stored  in  the  CAM.  This  realizes  a  doubly  linked 
data  structure  allowing  the  operations  to  be  carried  out  in 
either  directions.  An  entry  in  the  CAM  consists  of  several 
fields.  Complex  matching  and  masking  of  individual 
fields  can  be  achieved. 

2.3.  SNAP  Instruction  Stt 

The  following  instructions  have  been  provided  for 
processing  symbolic  data  on  SNAP  Their  general  format 
is  <opCode>  <argl>  <arg2>  <arg3>.  The  first 
field  <opCode>  is  the  name  of  the  function.  The  next 
two  fields  <argl>  and  <arg2>  can  be  flags,  nodes, 
relations,  or  don  't  cares.  If  the  argument  is  relation  then 
it  can  have  subfields  and  any  of  the  subfields  can  be  a 
don’t  care.  We  will  use  ’%'  to  denote  a  don’t  care  field. 
The  third  argument  is  generally  a  flag. 

1.  Search.  The  first  argument  is  a  node  and  the 
second  one  is  a  relation.  The  entries  in  the 
CAM  corresponding  to  the  relation  (<argl> 
<arg2>  %)  are  tagged  and  the  cells  which 
succeed  in  tagging  a  CAM  entry  will  set  their 
flag  indicated  by  the  third  argument.  For 
example,  (search  A  R  #1)  will  tag  all  CAM 
entries  matching  R  in  all  the  cells  holding 
node  A  and  set  the  flag  1  in  those  cells  where 
a  CAM  entry  was  tagged. 

2.  And, Or.  These  are  operations  on  flag  or 
register  arguments.  They  will  perform  the 
logical  operations  on  the  first  two  arguments 
and  place  the  result  in  the  third. 

3.  Not  will  perform  the  logical  negation  of  the 
first  flag  or  register  argument  and  place  the 
result  in  the  second. 

4.  And-Cam  (Or-Cam).  The  cell  whose  flag 
specified  by  <argl>  is  set  will  AND  (OR) 
together  all  the  tagged  entries  in  the  CAM 
with  the  data  specified  by  <arg2>),  and  put 
the  result  in  the  register  specified  by  <arg3>. 

5.  Compare.  The  cells  compare  the  registers 
specified  by  <argl>  and  <arg2>  for 
equality  and  set  the  flag  specified  by  <arg3> 
if  they  are  equal. 

6.  F-search.  The  cell  whose  flag  given  by 
<argl>  is  set  will  send  a  message  to  the  cells 
pointed  by  the  tagged  relations  in  its  CAM  to 
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set  the  flag  <arg2>  in  them.  The  instruction 
F-prop  is  equivalent  to  F-search  <argl> 
<arg2>  followed  by  F-search  <arg2> 
<arg2>  repeated  until  its  failure.  The 
instruction  F-prop  fails  if  the  CAM  has  no 
tagged  entries  or  the  activity  has  cycled  back 
to  the  originating  cell(s). 

7.  Collect, Collect-Relation.  Collect  will  retrieve 
the  names  of  nodes  from  the  cells  whose  flag 
given  by  <argl>  are  set  and  bind  the  list  to 
the  variable  <arg2>.  The  instruction 
Collect-Relation  works  much  the  same  as 
Collect  except  that  it  retrieves  the  tagged 
relations. 

8.  Create.  This  is  the  instruction  for  initialization 
of  the  array.  It  adds  the  relation  (<argl> 
<arg2>  <arg3>)  to  the  CAM  of  the  cell 
holding  <argl>.  The  node  <argl>  or 
<arg2>  will  be  creatced  if  it  does  not  already 
exist. 

9.  Delete.  The  instruction  will  delete  all  the 
tagged  entries  in  CAM  of  the  cell  whose 
flag  <argl>  is  a*t. 

10.  Send-Message.  The  cell  whose  fl*4  <argl>  is 
set  will  send  the  message  sivtn  by  <arg2>  to 
the  Mode  specified  by  <arg3>.  If  no 
destination  is  specified  then  the  message  is 
sent  to  the  cells  pointed  by  the  tagged 
relations  in  the  CAM.  Notice  that  this 
instruction  is  more  primitive  than  F-prop 
which  uses  this. 

11.  Messages-Exist  is  true  if  there  are  data 
packets  in  the  message-queue  of  the  cell. 

12.  Reset-Tags  will  reset  all  tags  in  the  CAM  of 
the  cell  whose  flag  <argl>  is  set. 

3.  The  Labeling  Problem 

The  labeling  problem  is  a  problem  of  constraint 
satisfaction.  Given  a  set  of  objects,  a  set  of  labels  and  a 
set  of  constraints,  the  problem  is  to  assign  a  label  to  each 
object  without  violating  any  of  the  constraints.  A 
labeling  is  called  unambiguous  if  it  assigns  a  single  label 
to  each  object  and  it  is  consistent  if  it  does  not  violate 
any  constraints.  An  ambiguous  labeling  assigns  a  set  of 
labels  to  each  object.  An  ambiguous  labeling  is  consistent 
as  long  as  for  every  choice  of  a  label  for  an  object  there 
exists  at  least  one  assigned  label  for  every  other  object 
that  does  not  violate  any  constraints.  There  may  be 
none,  one,  or  many  unambiguous  and  consistent  possible 
labelings.  Some  times  one  is  interested  in  any 
unambiguous  and  consistent  labeling  and  at  other  times  in 
all  possible  solutions. 


Depth-first  search  can  be  conducted  to  obtain  an 
unambiguous  and  consistent  labeling.  However,  with  no 
guidelines  for  choosing  the  initial  labelings,  the  search  can 
be  very  inefficient.  Sequential  search  can  be  speeded  up 
by  using  the  discrete  relaxation  technique  [10).  In  this 
technique,  the  label  assignment  that  leads  to  an 
inconsistent  labeling  is  removed,  thus  reducing  the  search 
space. 

In  this  section  we  discuss  the  parallel  processing  of 
the  labeling  problem  using  the  relaxation  technique  on 
SNAP.  First,  we  present  a  model  for  the  discrete 
relaxation  operation  and  show  how  this  model  can  be  put 
m  the  form  of  a  semantic  network  acceptable  by  SNAP. 
In  SNAP  Then,  we  snow  how  the  discrete  relaxation  can 
be  processed  on  this  architecture. 

3.1.  Problem  Definition 

What  follows  is  a  parallel  model  of  Waltz  filtering 
process  as  described  by  Rosenfeld  et.  al.  [10].  Let  A  = 
{aj,  a„,  ...,  an}  be  the  set  of  objects  to  be  labeled  with  the 
labels  in  the  set  d  =  {Xj,  X2,  ...,  Xm}.  Let  the  unary 
constraints  on  the  labeling  be  such  that  d;  C  A  be  the  set 
of  possible  labels  for  object  a..  Then  for  any  pair  of 
objects  (ajtaj),  let  dy  C  dfxdj  denote  the  compatible  labels 
in  accordance  with  the  binary  constraints,  i.e.,  (x,x'(  £  A y 
moans  that  it  is  possible  to  label  the  object  aj  with  X  and 
aj  with  without  violating  any  constraints.  General 
treatment  of  n-ary  constraints  can  be  found  in  [5l  In  this 
paper  we  shall  restrict  ourselves  to  unary  and  binary 
constraints.  A  labeling  L  =  (Lf,  ...  Ln)  of  A  is  an 
assignment  of  a  set  of  labels  Lj  C  A  to  each  of  aj6A.  The 
labeling  is  consistent  if  we  have 

Vi,j  vxeLj  ( {x}  x  Li )  n  dy  ^  0  (1) 

A  labeling  L  =  (Lj,  ...  LJ  is  said  to  be  included  in 
another  labeling  l!  =  (L' ,  ...  L'j  if  L(  C  Lj,  for  all  i.  A 
labeling  is  the  greatest  if  it  is  consistent  and  not  included 
in  any  other  consistent  labeling.  A  labeling  is  ambiguous 
if  it  assigns  more  than  one  label  to  any  object.  It  can  be 
shown  [10]  that  there  always  exist  consistent  labelings,  in 
particular,  the  null  and  the  greatest  labelings. 

The  algorithm  starts  with  an  initial  labeling  L°  = 
(3j,  ...,  3n).  Let  Lk  be  the  labeling  at  fcth  iteration  of  the 

algorithm.  Then  the  labeling  Lk+1  is  obtained  in  the  next 
iteration  by  discarding  any  label  X  from  each  Lk  such  that 

({xjxLj')  fl  dy  =  ip  for  some  j,  i.e.,  the  label  X  for  object  aj 

violates  the  consistency  condition  (1).  Due  to  finiteness  of 
the  problem  domain  the  algorithm  converges.  It  can  be 
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shown  [10]  that  the  algorithm  converges  to  the  greatest 
consistent  labeling. 

The  labeling  problem  has  many  applications. 
Haralick  and  Shapiro  [6]  have  shown  that  the  problems  of 
subgraph  isomorphism  [12],  scene  labeling  [1|,  shape 
matching  [2]  and  many  others  are  specific  instances  of  the 
labeling  problem.  Rutkowski  et.  al.  [11]  have  shown  the 
applicability  of  the  relaxation  technique  to  scene 
segmentation. 

3.2.  Semantic  Network  Representation  of  the 
Labeling  Problem 

The  labeling  problem  can  be  equivalently  described 
by  a  graph  model.  Construct  a  colored  directed  graph 
G=(V,L,E)  where 

1.  V  is  the  set  of  vertices  such  that  v.PV  <=>  a  PA 

i  i 

for  all  i, 

2.  L  is  the  set  of  colors,  L  =  A  x  A,  and 

3.  E  is  the  set  of  edges,  E  =  {  (v^.v-.l)  |  v^v-pV 
and  Ig.tjj  }. 

This  graph  represents  a  labeling  L=( Lj,  ...,  LJ  where 

Lj={X  I  (vjtvj.(x.x,)l  €  E  for  some  j  and  X.x'g.l}.  It  is 
easily  seen  that  the  graph  thus  constructed  represents  a 
consistent  labeling  iff 

Vv  VvVv  (v,v',(x,x'))  p  E 
=*  Vv"^v  3x"  (v,v",(x,x"))  g  E  (2) 

3.3.  Discrete  Relaxation  on  SNAP 

In  section  3.2  we  showed  how  the  discrete  relaxation 
problem  can  be  represented  by  a  semantic  network. 
Allocate  one  cell  to  each  vertex  in  the  semantic  network. 
For  this  problem  it  suffices  to  store  only  the  relations 
emanating  from  each  node.  The  relation  in  our  case  is 
the  tuple  (v  (x  v')  where  is  the  label  on  the  arc  from 
node  v  to  v.  To  simplify  checking  for  consistency 
condition  we  represented  the  vertices  by  bit  vectors.  The 
labels  are  encoded.  The  format  of  the  CAM  entries  is 
<x><x'xv'>. 

In  each  iteration  of  the  algorithm,  the  host 
computer  scans  through  all  the  labels.  Each  cell  tags  all 
the  relations  that  assign  the  label  selected  by  the  host  to 
the  object  represented  by  the  cell.  If  this  assignment  is 
not  consistent  then  the  tagged  relations  are  deleted.  The 
host  then  traces  along  those  relations  enabling  the  cells 
where  the  corresponding  backward  relations  are  stored. 
The  backward  relations  are  tagged  and  deleted.  With  the 
bit  vector  representation,  the  consistency  check  can  be 
accomplished  by  logical  OR  and  compare  instructions. 
The  program  flow  for  a  single  iteration  on  SNAP  is  shown 


below.  The  program  will  be  terminated  if  the  host  finds 
that  none  of  the  #2  flags  are  set  at  the  end  of  an 
iteration. 

{Votes : 

1.  The  Instruction  reset-flag  is  oot  a  SIAP 
primitive  It  can  be  written  using  AVD 
and  other  logical  operations 

2.  Formation  of  the  aessage  in  the  send-aessage 
instruction  can  be  done  using  other  functions 

3.  V  and  I  are  registers.  I  holds  the  id  of  the 
node  > 

FOR  ALL  \  E  -1 
BEG  1 1 

RESET-TAG  i|  {  reset  all  flags  and  tags) 
RESET-FLAG  9g 

SEARCH  %  (X  %  %)  91 

{  tag  all  relations 
beginning  with  X  ) 

OR-CAM  til?  {  OR  of  all  tagged 

relations  ) 

AVD  ?  ’vertex  V  {  get  vertex  field  of 

the  ORed  result  ) 

COMPARE  V  *111.. l  |  «2  {if  the  label 
points  to  all 
objects  then  set 
flag  2  otherwise 
the  label  is 
inconsistent  ) 

RESET-TAG  92  {if  consistent  then  do 

nothing  to  this  cell  ) 

VOT  92  92 

SEID-MESSAGE  92  (-search  *  (*  X)«  ■  -92-) 

{  send  the  aessage  to  tag 
backward  inconsistent 
relations  in  the  cells 
pointed  bj  inconsistent 
relations  ) 

DELETE  92  {  Delete  Inconsistent 

entries  ) 

EVD. 

Clearly,  the  running  time  for  each  iteration  is 
0(|.t|).  Due  to  possible  bus  contention  during  the  send- 
message  instruction,  the  total  running  time  is  much 
dependent  on  the  structure  of  the  semantic  network  and 
the  node  to  cell  allocation  strategy.  A  good  allocation 
policy  is  to  preserve  the  locality  of  the  nodes  in  the 
semantic  network  as  much  as  possible.  The  details  of 
applications  of  the  discrete  relaxation  and  the  symbolic 
execution  of  the  algorithm  on  SNAP  can  be  found  in  the 
paper  by  Dixit  and  Moldovan  [3]. 


4.  Stereo  and  Dynamic  Programming 

This  section  illustrates  how  SNAP  can  be  effectively 
used  to  solve  problems  requiring  dynamic  programming 
The  problem  of  stereo  matching  is  used  as  a  vehicle  to  do 
so.  The  definition  problem  and  its  solution  are  adopted 
from  Ohta  and  Kanade  [9],  We  would  not  describe  it 
here  in  great  detail.  The  interested  reader  should  consult 
the  original  report. 

Stereo,  whether  edge-based  or  feature-based,  has  the 
basic  problem  of  correspondence  of  the  objects  in  one 
image  to  those  in  the  other.  Ohta  and  Kanade  consider 
the  problem  of  correspondence  in  edge-based  stereo. 
They  suggest  a  solution  consisting  of  dynamic 
programming  at  two  levels.  On  the  inner  level,  the 
problem  is  solved  on  each  scanline;  on  the  outer  level,  the 
optimization  is  accomplished  for  the  whole  image.  The 
goal  is  to  find  a  path  in  3D  search  space,  which  is  a  stack 
of  2D  search  planes.  The  inner  level  dynamic 
programming,  called  intra-scanline  search,  is  to  find 
optimal  paths  in  2D  planes.  A  scanline  from  left  image 
and  another  from  right  image  form  the  axes  of  a  2D 
search  plane.  The  outer  level  dynamic  programming, 
called  inter-scanline  search,  is  to  find  an  overall  optimal 
natching  surface  (set  of  paths)  under  the  constraint  given 
by  connected  edges.  The  dynamic  programming  searches 
in  both  levels  are  conducted  simultaneously. 

4.1.  Intrascanline  search  on  2D  plane 

Let  m  =  (m,n)  represent  a  node  in  the  2D  search 
plane  where  m  and  n  are  the  indices  of  the  left  and  right 
image  edges  on  the  scanline.  Let  D(m.fc)  be  the  minimal 
cost  partial  path  from  node  k  to  node  m.  Let  d(m,k)  be 
the  cost  of  the  primitive  path  from  node  k  to  node  to. 
The  cost  of  a  primitive  path  is  a  function  of  intensity, 
color,  orientation,  and  other  directly  measurable 
quantities  of  the  edges  involved.  The  minimal  cost  path 
from  the  origin  D(m,0)  can  be  defined  recursively  as: 

£)(m)=min  {rf(m,m-*)+D(m-i)}  (3) 

{<} 

D(0)=O 

Here  m=(m,n),<=(i,», 0  <  t  <  m, 

0  <  j  <  n,i+j  5^  0. 

4.2.  Inter-scanline  search  in  3D  space 

The  3D  space  is  formed  by  the  stack  of  the  2D 
planes  for  intra-scanline  search.  Since  a  3D  node  is  a  set 
of  2D  nodes,  the  cost  of  a  3D  node  is  the  sum  of  the  costs 
of  its  2D  nodes  obtained  by  intra-scanline  search.  To  use 
dynamic  programming,  the  3D  nodes  must  be  ordered 
into  decision  stages.  The  ordering  rule  of  Ohita  and 
Kanade  is  :  H7ien  we  ezamine  the  correspondence  of  the 
two  connected  edges,  one  in  the  right  and  one  in  the  left 
image,  the  connected  edges  which  are  on  the  left  of  these 
connected  edges  in  each  image  must  already  be  processed. 


The  connected  edges  are  numbered  from  0  to  U  [V]  in  the 
left  [right]  image  Let  u  represent  the  3D  node  (u,t>).  The 
cost  of  optimal  paths,  C(«»),  which  come  up  to  the  3D 
node  u,  is  computed  as  follows: 

du) 

C(u)=min  y  {D(Hu,l),Hu-Ht  );<);/) 

<*•> 

+C(u-t(/);f)}  (4) 

C(0)=O,t'.e.C(0;/)=O  for  all  t 
Here  t.=(u,v),«(/)=(«(0,X0).0  <  .(/)  <  u,0  <  X<)  <V, 

«'(<)+X 0  ^  o. 

Here  C'(u,t)  is  the  cost  of  the  path  on  the  scanline  t 
in  the  optimal  set  and  C(u)  is  the  sum  of  Cji*./),  t  going 
from  starting  scanline  s(u)  to  ending  scanline  e(u). 
D(m,k),t)  is  the  cost  of  the  optimal  3D  primitive  path 
from  node  k  to  node  m  on  the  2D  plane  for  scanline  t. 
The  function  !(«»;/)  gives  the  index  of  a  node  belonging  to 
the  3D  node  u  on  the  2D  plane  for  scanline  t. 

4.3.  Stereo  on  SNAP 

Moldovan  [7]  describes  how  dynamic  programming 
can  be  mapped  onto  a  systolic  array  starting  from  a 
FORTRAN  program.  The  same  ideas  can  be  used  here  to 
identify  the  data  movements  when  the  stereo  problem  is 
mapped  onto  SNAP.  Although  SNAP  is  more  complex 
than  a  systolic  array,  it  can  be  used  in  systolic  mode  by 
simply  restricting  one’s  programming  style.  Assumption 
here  is  that  the  host  and  the  array  controller  are  capable 
of  issucing  commands  and  controlling  the  external  data 
movements  in  synchronous  with  the  activity  in  the  SNAP 
array.  The  solution  we  suggest  here  is  not  purely  systolic, 
but  a  combination  of  pipelining,  broadcasting,  and 
associative  memory  operations. 

I  t  us  rewrite  the  equation  (4)  for  computing  the 
minim  cost  path  as  FORTRAN  program.  In  the 

follow  g  program  two  new  variables  A  and  B  are 
introc  -ed  to  expand  the  minimization  operations  into 
FORI .  AN  loops, 
for  u  -  0  . . .  V 
tor  i  =  0  ...  u 

for  f  =  8( u)  ...  e(t») 
for  to  =  mfl  . . .  to  , 

tor  p  =  0  . . .  TOq-TOj 

D(.m,m0.t )  =  uln{D(m-j,tn0.l) 

♦  dim,m-j) .  D(m.m0.D) 
end 
end 

CW)  =  C(u-Ht) ,t)  ♦  Dtml,mg.t ) 

B(u.l)  =  Blu.t-1)  ♦  CW) 

end 

At t»)  =  aln  {A(u).  Btu .•)) 
end 

end. 
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lote: 

1 .  m0  =  I  («-<(/) ;  /) 
n»j  =  l(u;/) 

2.  Uae  of  a  vector  a*  the  lsdex  of  a  /or  loop  aieplj 
deaotea netted  for  loop*,  one  loop  for  each  component 
of  the  vector.  In  onr  caae,  the  order  of  nesting 
la  luaterial. 

4.3.1.  Data  Representation  and  Cell  Allocation 

One  row  of  SNAP  is  allocated  to  each  scanline. 
Each  cell  in  a  row  stores  information  about  one  edge  in 
left(right)  image,  and  the  associative  memory  within  each 
cell  contains  an  entry  for  each  possible  matching  edge  in 
the  right(left)  image.  An  entry  in  the  associative  memory 
has  the  fields  shown  in  Figure  4-1. 


(Note:  A,  B,  C,  and  D  are  variable  in  the  program) 
Figure  4-1:  CAM  Entry  for  a  Possible  Matching  Edge 

The  cells  are  assigned  to  the  edges  in  right  or  left 
image  such  that  the  vertically  (horizontally)  connected 
edges  are  mapped  onto  same  column  (row)  and 
consecutive  rows  (columns).  This  can  be  thought  of  as 
iconic  representation  of  the  image  at  edge  level.  If 
horizontally  connected  edges  are  present,  broadcasting 
may  be  necessary  to  compute  the  summation  over  the 
1-loop,  otherwise,  simple  pipelining  would  suffice. 

4.3.2.  Program  Flow 

The  program  on  the  host-SNAP  combination  is  a 
parallelized  and  pipelined  version  of  the  sequential 
program.  The  different  for  loops  are  executed  as  follows: 

u -loop  :  in  parallel  for  all  connected  edges 

i-loop  :  sequentially  run  by  host 

t -loop  :  summation  is  done  sequentially  by  host 

m-loop  :  in  parallel  for  all  edges  on  every  scanline 

p-loop  :  pipelined  along  rows 

The  host  executes  the  i-loop  sequentially  selecting  a 
point  »  =  (i,».  Each  cell  in  the  SNAP  computes  the 
limits  for  the  m-loop.  The  pipelined  execution  of  p-loop  is 
begins.  The  leftmost  cell  in  each  scanline  starts  sending 
the  values  of  D(*)  of  each  CAM  entry  along  with  other 
information  to  the  cell  to  its  right.  The  other  cells  receive 
the  values  from  the  left  cell  and  update  their  values  of 
£>(*).  The  £>(*)  values  are  updated  for  all  entries  in  the 
CAM  simultaneously  by  bitwise  arithmetic  operations. 
These  operations  are  addition,  multiplication,  and  taking 


the  minimum.  Algorithms  for  such  CAM  computations 
are  described  by  Foster  [4).  After  updating  the  D(*j 
values,  the  packet  received  from  left  is  pipelined  to  the 
right  cell.  When  the  cell  originating  the  data  packets  for 
the  pipeline  sends  its  last  packet,  which  is  marked  as 
such,  the  next  cell  becomes  the  originator.  The  p-loop 
ends  when  the  rightmost  cell  becomes  the  originator. 
Now  the  summation  over  scanlines,  the  /-loop,  is  done 
serially  by  the  host  employing  broadcasting  and 
pipelining.  A  minimum  is  taken  to  compute  the  value  of 
.4  and  the  process  is  iterated  for  another  value  of «. 

In  order  to  retrieve  the  optimal  path,  one  must  do  a 
backward  search:  start  from  the  final  point  (cell  and 
CAM  entry)  in  the  path.  Fetch  the  id  of  the  previous 
point  which  produced  this  value.  Go  to  that  previous  cell 
and  so  on  until  you  reach  the  origin  of  the  path. 

4.3.3.  Complexity 

The  maximum  number  of  the  total  CAM  entries  in 
any  row  is  not  more  than  MN ,  the  product  of  the 
maximum  number  of  edges  on  any  scanline  in  the  two 
images.  Similarly  the  time  for  »  loop  is  where  U 

and  V  are  the  number  of  connected  edges  in  the  left  and 
the  right  image  respectively.  The  summation  over 
scanlines  takes  the  time  proportional  to  maximum  of  the 
lengths  of  connected  edges,  which  is  0(7b),  where  T  is  the 
number  of  scanlines  and  b  is  the  number  of  bits  used  tor 
numbers.  Thus  the  total  time  on  SNAP  is  0(UVMNTb). 
Compare  this  with  the  sequential  execution  time 
0(b¥A/¥7). 

5.  Conclusions 

In  this  paper  we  presented  an  architecture  intended 
for  symbolic  processing.  We  examined  the  applicability  of 
SNAP  to  image  understanding  problems  of  scene  labeling 
and  stereo.  The  two  problems  are  computationally 
intensive  and  employ  different  techniques  for  their 
solutions.  Solution  to  Stereo  uses  intra-  and  in/er-scanline 
dynamic  programming  technique  where  the  computations 
are  highly  ordered.  However,  the  discrete  relaxation  used 
for  the  labeling  problem  is  a  parallel  technique.  Since 
these  are  general  techniques,  any  other  problems  that  use 
them  for  their  solutions  could  be  solved  on  SNAP  with 
the  same  speed  up. 
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EXTENDED  GAUSSIAN  IMAGES 


Berthold  K.  P.  Horn 
Artificial  Intelligence  Laboratory 
.Massachusetts  Institute  of  Technology 


Abstract 

This  is  a  primer  on  extended  Cau*  ...i  Images.  Extended 
Gaussian  Images  are  useful  for  representing  the  shapes  of 
surfaces.  They  can  be  computed  easily  from: 

1.  Needle  maps  obtained  using  photometric  stereo,  or 

2.  Depth  maps  generated  by  ranging  devices  or  binocular 
stereo. 

Importantly,  they  can  also  be  determined  simply  from 
geometric  models  of  the  objects.  Extended  Gaussian  images 
can  be  of  use  in  at  least  two  of  the  tasks  facing  a  machine 
vision  system: 

1.  Recognition,  and 

2.  Determining  the  attitude  in  space  oi  an  object. 

Here,  the  extended  Gaussian  image  is  defined  and  some 
of  its  properties  discussed.  An  elaboration  for  non-convex 
objects  is  presented  and  several  examples  are  shown. 


1.  Introduction 

In  order  to  recognize  an  object  and  to  determine 
its  attitude  in  space,  it  is  necessary  to  have  a  way  of 
representing  the  shape  of  its  surface.  Giving  the  distance 
to  the  surface  along  parallel  rays  on  a  regularly  spaced  grid 
provides  one  way  of  doing  this.  This  simple  representation 
is  called  a  depth  map.  A  range  finder  produces  surface 
descriptions  in  this  form,  as  docs  automated  binocular 
stereo  [1].  Unfortunately,  depth  maps  do  not  transform 
in  a  simple  way  when  the  object  rotates  (For  one  thing, 
interpolation  must  be  used  to  get  a  new  depth  map  on  a 
regularly  spaced  grid). 

Alternatively,  surface  orientation  might  be  given  for 
points  on  the  surface  on  some  regular  sampling  irrid.  This 
grid  may  conveniently  correspond  to  the  picture  cells  in 
an  image.  Such  a  simple  representation  is  called  a  needle 
map  (Figure  1)  [2].  Photometric  stereo  is  a  method  for 
recovering  surface  orientation  using  multiple  images  taking 
with  different  lighting  conditions  [3  -8].  It  produces  surface 
descriptions  in  this  form.  A  needle  map  also  is  not  directly 


helpful  when  if  comes  to  comparing  surfaces  of  objects 
that  may  be  seated  relative  to  one  another  (Moth  depth 
maps  and  needle  maps  depend  on  the  position  of  the  object 
as  well  as  its  attitude). 
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Figure  1.  A  needle  map  shows  unit  surface  normals 
at  points  on  the  surface  on  a  regular  grid.  Normals  which 
point  towards  the  viewer  will  be  seen  as  dots,  while  tilted 
surface  patches  give  rise  to  normals  which  arc  shown  as 
lines  pointing  in  the  direction  of  steepest  descent. 


The  extended  Gaussian  image,  on  the  other  hand, 
docs  make  it  easy  to  deal  with  the  varying  attitude  of  an 
object  in  space  as  we  shall  see  [2,  9-13)  For  one  thing,  it  is 
insensitive  to  the  position  of  the  object.  Some  information 
appears  to  be  discarded  in  the  formation  of  the  extended 
Gaussian  image.  Curiously,  in  the  ease  of  convex  objects, 
the  representation  is  nevertheless  unique.  That  is,  no  two 
convex  objects  have  the  same  extended  Gaussian  image. 

This  representation  of  the  shape  of  the  surface  of 
an  object  allows  one  to  match  information  obtained 
from  image  or  range  sensors  with  that  contained  in 
computer  models  of  the  objects  and  has  proven  most 


72 


useful  ill  work  on  automatic  bin  picking  [14  16].  A  recent 
report  describes  a  system  that  picks  one  object  out  of  a 
jumbled  pile  of  similar  objects  using  this  approach  [17]. 
We  start  our  discussion  with  objects  having  planar  faces. 
Later  we  consider  smoothly  curved  objects.  Methods  for 
computing  discrete  approximations  of  extended  Gaussian 
images,  called  orientation  histograms,  are  presented  too. 
Orientation  histograms  can  be  computed  from  experimental 
data  or  mathematical  descriptions  of  the  objects.  Sections 
marked  with  an  asterisk  may  be  omitted  on  first  reading 
or  if  your  interest  in  the  mathematical  details  is  limited. 

2.  DiscreteCasc:  Convex  Polyhedra 

Minkowski  showed  in  1897  that  a  convex  polyhedron 
is  fully  specified  (up  to  translation)  by  the  area  and 
orientation  of  its  faces  [18-  20]  We  can  represent  area  and 
orientation  of  the  faces  conveniently  by  point  masses  on 
a  sphere.  Imagine  moving  the  unit  surface  normal  of  each 
face  so  that  its  tail  is  at  the  center  of  a  unit  sphere. 
The  head  of  the  unit  normal  then  lies  on  the  surface 
of  the  unit  sphere.  This  sphere  is  called  the  Gaussian 
sphere  and  each  point  on  it  corresponds  to  a  particular 
surface  orientation.  The  extended  Gaussian  image  of  the 
polyhedron  is  obtained  by  placing  a  mass  at  each  point 
equal  to  the  surface  area  of  the  corresponding  face  (Figure 
2). 


Figure  2.  The  extended  Gaussian  image  of  apolyhedron 
can  be  thought  off  as  a  collection  of  point  masses  on  the 
Gaussian  sphere.  Each  mass  is  proportional  to  the  area 
of  the  corresponding  Taco.  Point  masses  on  the  visible 
hemisphere  are  shown  as  solid  marks,  the  others  as  open 
marks.  The  center  of  mass  (shown  as  the  symbol  must 
be  at  the  center  of  the  sphere  if  the  polyhedron  is  a  closed. 

It  seems  at  first,  as  if  some  information  is  lost  in 
this  mapping,  since  the  position  of  the  surface  normals  is 
discarded.  Viewed  another  way,  no  note  is  made  of  the 
shape  of  the  faces  or  their  adjacency  relationships.  It  can 
nevertheless  be  shown  that  (up  to  translation)  the  extended 
Gaussian  image  uniquely  defines  a  convex  polyhedron  [9]. 
An  iterative  algorithm  has  recently  been  invented  for 
recovering  a  convex  polyhedron  from  its  extended  Gaussian 
image  [21]. 


2.1.  Properties  of  the  Extended  Gaussian  Image 

The  extended  Gaussian  image  is  not  affected  by 
translation  of  the  object.  Rotation  of  the  object  induces 
an  equal  rotation  of  the  extended  Gaussian  image,  since 
the  unit  surface  normals  rotate  with  the  object. 

Mass  distributions  which  lie  entirely  within  one 
hemisphere,  that  is,  are  zero  in  the  complementary 
hemisphere,  do  not  correspond  to  closed  objects.  As  we 
shall  see,  the  center  of  mass  of  an  extended  Gaussian  image 
has  to  lie  at  the  origin.  This  is  clearly  not  possible  if  a  whole 
hemisphere  is  empty.  Also,  a  mass  distribution  which  is 
non- zero  only  on  a  great  circle  of  the  sphere  corresponds  to 
the  limit  of  a  sequence  of  cylindrical  objects  of  increasing 
length  and  decreasing  diameter  (Figure  3).  We  will  exclude 
such  pathological  cases  and  conGnc  our  attention  to  closed, 
bounded  objects  [9,  20]. 


Figure  3.  A  mass  distribution  confined  to  a  great 
circle  corresponds  to  the  limit  of  a  sequence  of  cylindrical 
objects  of  increasing  length  and  decreasing  diameter.  Such 
pathological  mass  distributions  can  be  avoided  if  we  conGnc 
our  attention  to  bounded  objects. 

Some  properties  of  the  extended  Gaussian  image  are 
important:  Firstly,  the  total  mass  of  the  extended  Gaussian 
image  is  obviously  just  equal  to  the  total  surface  area  of 
the  polyhedron.  If  the  polyhedron  is  closed,  it  will  have 
the  same  projected  area  when  viewed  from  any  pair  of 
opposite  directions.  This  allows  us  to  compute  the  location 
of  the  center  of  mass  of  the  extended  Gaussian  image. 


Figure  4.  A  surface  element  appears  smaller  because 
of  foreshortening.  The  apparent  area  is  the  true  area  times 
tin  cosine  of  the  angle  between  the  surface  normal  and  the 
vector  pointing  towards  the  viewer. 

Imagine  viewing  a  convex  polyhedron  from  a  great 
distance.  Let  the  direction  from  the  object  towards  the 
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viewer  be  given  by  the  unit  vector  v.  A  face,  with  unit 
normal  s,,  will  be  visible  only  if  s,  ■  v  >  0.  Suppose  that 
the  surface  area  of  this  face  is  O, .  Due  to  foreshortening 
it  will  appear  only  as  large  as  would  a  face  of  area 

(s,  •  v)  O,, 

normal  to  v  (Figure  4).  The  total  apparent  area  of  the 
visible  surface  is 

A(*)  =  E  (S.v)0„ 

{i|S,V>U} 

when  viewed  from  the  direction  v.  The  tota1  apparent 
area  of  the  visible  surface  when  viewed  from  the  opposite 
direction  is 

>l(-v)=  E  (si-vJOi, 

{i  I  V<0} 

This  should  be  the  same,  that  is,  A(v)  =  A(— v). 
Consequently, 

E  (S>  '  °i  =  E  v  =  0, 

all  >  (.all  i 

where  the  sum  now  is  over  all  faces  of  the  object.  This 
holds  true  for  all  view  vectors,  v,  so  we  must  have 

E  3.  0;  =  0. 

all 

That  is,  the  center  of  mass  of  the  extended  Gaussian  image 
is  at  the  origin. 

An  equivalent  representation,  called  a  spike  model,  is 
a  collection  of  vectors  each  of  which  is  parallel  to  one  of 
the  surface  normals  and  of  length  equal  to  the  area  of 
the  corresponding  face.  The  result  regarding  the  center 
of  mass  is  equivalent  to  the  statement  that  these  vectors 
must  form  a  closed  chain  when  placed  end  to  end  (Figure 
5). 


Figure  5.  Vectors  parallel  to  the  normals  of  the  races 
of  a  polyhedron,  and  of  length  equal  to  the  areas  of  the 
to°endPOndlng  faces-  forrn  a  clwed  chain  when  placed  end 

2.2.  Reconstruction  of  a  Tetrahedron  (*) 

Faces  that  share  a  common  edge  arc  said  to  be  adjacent. 
The  masses  on  the  Gaussian  sphere  corresponding  to  two 
adjacent  faces  need  not  be  each  others  closest  neighbors. 
Recovering  a  polyhedron  from  its  extended  Gaussian  image 


is  not  easy  in  the  general  case  because  it  is  hard  to  determine 
which  faces  arc  adjacent  (2 1 J .  Finding  the  actual  offsets  of 
each  of  the  faces  from  the  center  of  mass  of  the  polyhedron 
is  not  as  hard. 

The  structure  of  a  tetrahedron,  however,  is  very 
simple:  Every  face  is  adjacent  to  the  other  three.  The 
shape  of  the  tetrahedron  is  completely  determined  by  the 
surface  normals  of  the  four  faces,  only  the  size  of  the 
tetrahedron  remaining  to  be  determined.  In  other  words: 
There  is  only  one  degree  of  freedom  left.  Another  way  to 
look  at  it  is  to  note  that  the  four  faces  must  have  areas 
that  place  the  center  of  mass  of  the  extended  Gaussian 
image  at  the  origin,  as  we  have  just  seen.  This  condition 
places  three  constraints  on  the  four  parameters. 


Figure  6.  A  tetrahedron  with  vertices  A,  B,  C,  and 
D.  We  are  to  find  the  distances  of  the  faces  from  the  center 
of  mass,  given  the  areas  and  surface  normals  of  the  faces. 

Let  the  given  unit  surface  normals  be  a,  b,  c,  and 
d,  and  the  areas  of  the  corresponding  faces,  A,  B,  C, 
and  D  (Figure  6).  We  have  to  determine  the  distances, 
a,  b,  c,  and  d,  of  these  faces  from  the  center  of  mass  of 
the  tetrahedron.  From  these  distances  we  can,  if  desired, 
compute  the  positions  of  the  vertices  A,  B,  C,  and  D, 
simply  by  intersecting  three  of  the  planes  at  a  time.  The 
notation  here  is  that  the  face  opposite  vertex  A  has  area 
A  and  unit  surface  normal  a,  and  so  on. 

The  perpendicular  distance  of  the  center  of  area  of  a 
triangle  from  one  of  the  sides  is  one  third  the  perpendicular 
distance  of  the  vertex  opposite  that  side.  Similarly,  in  a 
tetrahedron,  the  distance  from  the  center  of  mass  to  a 
particular  face  is  equal  to  one  quarter  of  the  distance  of  the 
vertex  opposite  that  face.  We  start  by  finding  a  formula  for 
the  distance  of  the  face  with  area  D,  say,  from  the  opposite 
vertex  d.  The  desired  distance,  d,  will  be  just  a  quarter 
of  the  result  obtained  in  this  fashion.  The  remaining  three 
distances,  a,  b,  and  c,  can  then  be  computed  using  formulae 
obtained  by  cyclical  permutation  of  the  variables. 

The  position  of  the  reconstructed  tetrahedron  is 
arbitrary,  since  the  extended  Gaussian  image  is  insensitive 
to  translation.  To  make  the  result  unique,  we  might  place 
the  center  of  mass  at  the  origin.  To  reduce  the  site  of 
the  expressions  to  be  manipulated  here,  however,  it  is 
convenient  to  move  the  tetrahedron  so  that  one  vertex,  D, 
say,  is  at  the  origin.  The  distances  of  the  faces  from  the 
center  of  mass  arc  obviously  not  affected  by  this. 
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So  that  finally, 


Suppose  for  now  that  we  know  the  locations  of  the 
vertices  A,  B,  and  C  relative  to  D.  We  can  then  compute 
the  directions  of  the  six  edges  of  the  tetrahedron,  by  taking 
all  of  the  distinct  pairwise  differences  of  the  four  vertex 
positions.  Four  surface  normals  can  then  be  found  by 
taking  cross-products  of  these  edge-direction  vectors.  We 
actually  need  only  four  of  the  edge  vectors  forming  a  closed 
circuit  to  do  this.  The  results  can  then  be  normalized  to 
obtain  unit  surface  normals, 


B  X  C  -  C  X  A  .  A  X  B 
a~  IBXCI'  |C  X  A| ’  C  |A  X  B|’ 

and, 

(A-C)X(B-A)  _  AxB  +  BxC  +  CxA 
d  _  |(A  -  C)  X  (B  -  A)|  ~  |A  X  B  +  B  X  C  +  C  X  A|  ‘ 
Now  the  perpendicular  distance  of  the  plane  with  area  D 
from  the  origin  can  be  found  by  taking  the  dot-product 
of  any  of  the  three  vertices,  A,  B,  and  C  with  the  unit 
normal  d.  Thus, 


id  =  d-A  =  d-B  =  d-C  = 


_ [ABC] _ _ 

|A  X  B  +  B  X  C  -I-  C  X  Aj ' 


The  area  of  the  facet  opposite  the  origin  is  also  easy  to 
compute, 


£l=i|(A-C)X(B-A)|  =  i|AxB  +  BxC  +  CXA|. 

2  * 

Our  task  is  to  express  the  offset  d  in  terms  of  the  area 
D  and  the  given  unit  surface  normals.  The  two  formulae 
above  do  not  allow  us  to  do  that  directly,  because  wc  do 
not  know  what  the  value  of  (ABC)  is.  This  quantity,  by 
the  way,  is  six  times  the  volume,  V,  of  the  tetrahedron, 
or, 


V=^(4d)Z>=i[ABC], 

We  proceed  by  considering  the  four  distinct  triple  products 
of  the  four  unit  surface  normals.  First  of  all, 

[ABC]2 


[a  b  c  =  — 


|A  X  B||B  X  C||C  X  A|’ 


since  [(x  X  y)(y  X  z)(z  X  x)j  =  [xy  z]2.  Then,  by  similar 
reasoning, 


|B  X  C||C  X  A||A  X  B  +  B  X  C  +  C  X  A|' 

since  [x  y  (x  +  y  +  z)j  =  |xyi].  Formulae  for  [b  c  d]  and 
[c  a  d]  can  be  found  by  cyclical  permutation  of  the  variables. 

Multiplying  the  three  formulae  found  this  way  together 
we  get, 

[ab  d][b  c  d](c  ad]  = 

_ _ _ [ABC]* _ 

(A  x  Cf[Cx  ApiA  x  BtBxC  t  Cx  A|3  ’ 

and  so 

[ab  d|[b  c  d][c  ad]  _ [ABC]2 _ 

[a b c]2  ~  |AxB  +  BxC  +  CxA| 

=  (4d)2(2D). 


4  d  = 


yJ(2D)[k  b  d][b  c  d][c  a  d] 
—  [a  b  c] 


The  other  distances,  a,  6,  and  c,  can  be  computed  using 
similar  formulae  obtained  by  cyclical  permutation  of  the 
variables. 


3.  Continuous  Case:  Smcxithly 
Curved  Objects 

The  ideas  presented  in  the  previous  chapter  can  be 
extended  to  apply  to  smoothly  curved  surfaces. 

3.1.  Gaussian  Image 

One  can  associate  a  point  on  the  Gaussian  sphere 
with  a  given  point  on  a  surface  by  finding  the  point  on 
the  sphere  which  has  the  same  surface  normal  (Figure 
7)  [20,  22,  25).  Thus  it  is  possible  to  map  information 
associated  with  points  on  the  surface  onto  points  on  the 
Gaussian  sphere.  In  the  case  of  a  convex  object  with 
positive  Gaussian  curvature  everywhere,  no  two  points 
have  the  same  surface  normal.  The  mapping  from  the 
object  to  the  Gaussian  sphere  in  this  case  is  invertible: 
Corresponding  to  each  point  on  the  Gaussian  sphere  there 
is  a  unique  point  on  the  surface  (If  the  convex  surface  has 
patches  with  zero  Gaussian  curvature,  curves  or  even  areas 
on  it  may  correspond  to  a  single  point  on  the  Gaussian 
sphere). 


Figure  7.  The  Gaussian  image  of  an  object  is  obtained 
by  associating  with  each  point  on  its  surface  the  point  on  the 
Gaussian  sphere  which  has  the  same  surface  orientation. 
The  mapping  is  invertible  if  the  object  has  positive  Gaussian 
curvature  everywhere. 


One  useful  property  of  the  Gaussian  image  is  that 
it  rotates  with  the  object.  Consider  two  parallel  surface 
normals,  one  on  the  object  and  the  other  on  the  Gaussian 
sphere.  The  two  normals  will  remain  parallel  if  the  object 
and  the  Gaussian  sphere  are  rotated  in  the  same  fashion. 
A  rotation  of  the  object  thus  corresponds  to  an  equal 
rotation  of  the  Gaussian  sphere. 
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3.2.  Gaussian  Curvature 

Consider  a  small  patch  60  on  the  object.  Each  point 
in  this  patch  corresponds  to  a  particular  point  on  the 
Gaussian  sphere.  The  patch  <50  on  the  object  maps  into 
a  patch,  <5S  say,  on  the  Gaussian  sphere  (Figure  8).  If 
the  surface  is  strongly  curved,  the  normals  of  points  in 
the  patch  will  point  into  a  wide  fan  of  directions.  The 
corresponding  points  on  the  Gaussian  sphere  will  be  spread 
out.  Conversely,  if  the  surface  is  planar,  the  surface  normals 
are  parallel  and  map  into  a  single  point. 


figure  8.  A  patch  on  the  object  maps  into  a  patch  on 
1  he  Gaussian  sphere.  I  he  Gaussian  curvature  is  the  limit 
ol  the  ratio  of  the  area  of  the  patch  on  the  Gaussian  sphere 
to  the  area  of  the  patch  on  the  object  as  these  become 
smaller  and  smaller. 

These  considerations  suggest  a  suitable  definition  of 
curvature.  The  Gaussian  curvature  is  defined  to  be  equal 
to  the  limit  ol  the  ratio  of  the  two  areas  as  they  tend  to 
zero.  That  is, 

K  =  lim  6S  ==  dS 
H)  .0  60  <tO 

from  this  differential  relationship  wc  can  obtain  two  useful 
integrals.  Consider  first  integrating  K  over  a  finite  patch 
O  on  the?  object: 

UKd°~IsldS  =  S, 

where  5  is  the  area  of  the  corresponding  patch  on  the 
Gaussian  sphere.  The  expression  on  the  left  is  called  the 
integral  curvature.  This  relationship  allows  one  to  deal 
with  surfaces  which  have  discontinuities  in  surface  normal. 

Now  consider  instead  integrating  \jK  over  a  patch  5 
on  the  Gaussian  sphere: 

fsIi/Kds  =LIdo  =°- 

v.  here  O  is  the  area  of  the  corresponding  patch  on  the 
object.  This  relationship  suggests  the  use  oT  the  inverse  of 
the  Gaussian  curvature  in  the  definition  of  the  extended 
o.ue.irtli  image  nf  a  slnoot.il ty  cmved  ooject,  as  we  shall 
see.  It  also  shows,  by  the  way,  that  the  integral  of  \/K 
over  the  whole  Gaussian  sphere  equals  the  total  area  of 
the  object. 


Consider  a  plane  which  includes  the  surface  normal  at 
some  point  on  a  smooth  surface.  The  surface  cuts  this  plane 
along  a  curve  called  a  normal  section  (figure  9)  [19,  22  25]. 
Let  the  curvature  of  the  normal  section  be  denoted  by  /c/y. 
Consider  the  one-parameter  family  of  planes  containing 
the  surface  normal.  Suppose  that  0  is  the  angle  between 
a  particular  plane  and  a  given  reference  plane.  Then  k /v 
varies  with  0  in  a  periodic  fashion.  In  fact,  if  wc  measure 
0  from  the  plane  that  gives  maximum  curvature,  then  it 
can  be  shown  that 

kn[0)  =  i  cos2  0  +  sin2  0, 

where  /cj  is  the  maximum  and  is  the  minimum  curvature. 
These  two  values  of  K  are  called  the  principal  curvatures. 
The  corresponding  planes  arc  called  the  principal  planes. 
The  two  principal  planes  are  orthogonal,  provided  that  the 
principal  curvatures  are  distinct  (Figure  9). 


Figure  9.  Normal  sections  or  the  surface  are  made 
with  planes  which  include  the  surface  normal.  The  planes 
corresponding  to  the  largest  and  smallest  value  of  curvature 
are  referred  to  as  the  principal  planes.  The  Gaussian 
curvature  is  equal  to  the  product  of  the  largest  and  smallest 
values  of  curvature. 


It  turns  out  that 

K  =  K  \K2 

is  equal  to  the  Gaussian  curvature  introduced  earlier.  This 
is  clearly  zero  for  a  plane.  It  is  equal  to  1  /It2  for  a  spherical 
surface  of  radius  /{,  since  the  curvature  or  any  normal 
section  is  1  JR. 

A  ruled  surface  is  one  which  can  be  generated  by 
sweeping  a  line  through  space.  A  hyperboloid  provides  one 
example  of  such  a  surface.  Developable  surfaces  are  special 
cases  of  ruled  surfaces  [22,  23,  25],  Cylindrical  and  conical 
surfaces  arc  examples  of  developable  surfaces  (Figure  10). 
On  a  developable  surface  at  least  one  of  the  two  principal 
curvatures  is  zero  at  all  points.  Consequently  the  Gaussian 
curvature  is  zero  everywhere  too. 


3.3.  Alternate  Definition  of  Gaussian  Curvature 


Figure  10.  A  conical  surface  is  an  example  of  a 
developable  surface.  On  it  the  Gaussian  curvature  is 
everywhere  ■/.  o,  because  (at  least)  one  of  the  principal 
curvatures  is  zero. 

3.4.  The  Extended  Gaussian  Image 

We  can  define  a  mapping  which  associates  the  inverse 
of  the  Gaussian  curvature  at  a  point  on  the  surface  of 
the  object  with  the  corresponding  point  on  the  Gaussian 
sphere.  Let  u,  and  v  be  parameters  used  ‘o  identify  points 
on  the  original  surface,  i.  imilarly.  let  £  and  r)  be  parameters 
used  to  identify  points  on  the  Gaussian  sphere  (These  could 
be  longitude  and  latitude,  for  example).  Then  we  define 
the  extended  Gaussian  image  as 


c(e,«i)- 


K(u,v)’ 


where  (£,»))  is  the  point  on  the  Gaussian  sphere  which 
has  the  same  normal  as  the  point  (u,v)  on  the  original 
surface.  It  can  be  shown  that  this  mapping  is  unique  (up 
to  translation)  for  convex  objects.  That  is,  there  is  only 
one  convex  object  corresponding  to  a  particular  extended 
Gaussian  image  (9,  19,  2fij  The  proof  is  unfortunately 
non-constructivc  and  no  direct  method  for  recovering  the 
object  is  known. 

3.5.  Properties  of  the  Extended  Gaussian  Image 

1  he  center  of  mass  of  the  extended  Gaussian  image 
of  a  smoothly  curved  object  is  at  the  origin.  We  show  this 
in  a  way  similar  to  that  used  earlier  for  extended  Gaussian 
images  of  polyhedral  objects.  Consider  viewing  a  convex 
object  from  a  great  distance.  Let  the  direction  from  the 
object  towards  the  viewer  be  given  by  the  unit  vector  v. 
A  surface  patch  with  unit  normal  s  will  be  visible  only  if 
s  ■  v  >0.  Suppose  its  surface  area  is  <50  (Figure  4).  Due 
to  foreshortening  it  will  appear  only  as  large  as  would  a 
patch  of  area 

(s  ■  v)  <50, 

normal  to  v.  Let  /f(v)  bo  the  unit  hemisphere  for  which 
s  •  v  >  0.  Ther  the  apparent  area  of  the  visible  surface  is 

when  viewed  from  the  direction  v.  The  apparent  area  of 
the  visible  surface  when  viewed  from  the  opposite  direction 


M-*)  =  //(_v,  /  G(S)  (®  '  -*) dS ■ 

This  should  be  the  same,  that  is,  A(v)  =  A{— v). 
Consequently, 

fs  /  G(s)  (S  •  v)  dS  =  [Js  f  C(s)  a  ds'j  •  v  =  0, 

where  the  integral  now  is  over  the  whole  sphere  S.  This 
holds  true  for  all  view  vectors,  v,  so  we  must  have 

/s/c(s)adS  =  0. 

That  is,  the  center  of  mass  of  the  extended  Gaussian  image 
is  at  the  origin  (This,  by  the  way,  is  not  a  very  helpful 
constraint  in  practice,  since  one  usually  only  sees  one  side 
of  the  object). 

Another  property  of  the  extended  Gaussian  image  is 
also  easily  demonstrated.  The  total  mass  of  the  extended 
Gaussian  image  equals  the  total  surface  area  of  the  object. 
If  one  wishes  to  deal  with  objects  of  the  same  shape  but 
differing  size  one  may  normalize  the  extended  Gaussian 
image  by  dividing  by  the  total  mass. 

We  can  think  of  the  extended  Gaussian  image  in  terms 
of  mass  density  on  the  Gaussian  sphere.  It  is  possible  then 
to  deal  in  a  consistent  way  with  places  on  the  surface 
where  the  Gaussian  curvature  is  zero,  using  the  integral 
of  l/K  shown  earlier.  A  planar  region,  for  example, 
corresponds  to  a  point  mass.  This  in  turn  corresponds  to  an 
impulse  function  on  the  Gaussian  sphere  with  magnitude 
proportional  to  the  area  of  the  planar  region. 

A  mass  distribution  has  inertia  about  an  axis  passing 
through  its  center  of  mass  that  depends  on  the  direction  of 
the  axis.  This  inertia  takes  on  three  stationary  values,  for 
three  particular  orthogonal  directions,  called  the  principal 
axes  of  the  object.  It  is  tempting  to  imagine  that  one  can 
find  the  attitude  or  an  object  by  lining  up  the  principal  axes 
of  inertia  of  the  observed  extended  Gaussian  image  and  the 
one  computed  from  the  geometric  model  [9].  This  would 
be  rather  straightforward,  requiring  only  the  calculation 
of  the  eigenvectors  of  a  three  by  three  inertia  matrix. 
In  practice,  one  typically  has  information  only  about  the 
visible  hemisphere  and  thus  cannot  compute  the  required 
Crsl  and  second  moments  over  the  whole  sphere. 

3.6.  Objects  that  are  not  Convex 

Three  things  happen  when  the  surface  is  non-convex: 

1.  The  Gaussian  curvature  for  some  points  will  be 
negative. 

2.  More  than  one  point  on  the  object  will  contribute  to 
a  given  point  on  the  Gaussian  sphere. 

3.  Farts  of  the  object  may  be  obscured  by  other  parts. 

We  chose  to  extend  the  definition  of  the  extended  Gaussian 
image  in  this  case  to  be  the  sum  of  the  absolute  values  of 
the  inverses  of  the  Gaussian  curvature  at  all  points  having 
the  same  surface  orientation, 
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GU,n)-E,^(Ui)„.),- 

This  definition  is  motivated  by  the  method  used  to  compute 
the  extended  Gaussian  image  in  the  discrete  case,  as  we 
will  see  later. 

The  above  extension  makes  sense  if  there  arc  a  finite, 
or  at  most  a  countable,  number  of  points  on  the  surface 
with  the  same  orientation.  At  times,  however,  all  points 
on  a  curve  or  even  an  area  on  the  surface  have  parallel 
surface  normals.  In  this  case  we  may  use, 

G(n)  =  J0j6(A  -  s)d°’ 

where  n  is  a  unit  vector  on  the  Gaussian  sphere,  while  s  is 
a  unit  vector  on  the  surface  of  the  object.  The  integration 
is  over  the  whole  surface  of  the  object  O  and  S  is  the  unit 
impulse  function  defined  on  a  sphere. 

We  can  be  more  specific,  if  we  let  r(u,  t>)  be  a  vector 
giving  the  point  on  the  surface  corresponding  to  the 
parameters  u  and  v,  then 

C({,t7)cos  =  j  j  a(£-0(u,u),rj-<*(u,u))|ruXr„|dudt>, 

where  0(u,v)  and  <t>(u,v)  are  the  latitude  and  longitude 
of  the  point  on  the  Gaussian  sphere  which  has  the  same 
orientation  as  the  surface  docs  at  the  point  (u,u).  A  planar 
region  of  area  A  will  thus  contribute  an  impulse  of  weight 
A  to  the  extended  Gaussian  image,  while  a  cylindrical 
region  will  give  rise  to  an  impulse  wall  along  a  great  circle 
at  right  angles  to  the  axis  of  the  cylinder.  The  integral  of 
the  impulse  wall  will  be  equal  to  the  area  of  the  cylindrical 
region. 

Usually  we  think  of  the  extended  Gaussian  image  as 
a  fixed  entity  associated  with  an  object.  In  the  case  of 
non-convex  objects  we  might  want  to  alter  the  definition 
to  include  only  those  parts  of  the  surface  visible  from 
a  particular  direction.  This  would  make  the  (modified) 
Gaussian  image  dependent  on  the  view-point.  We  avoid 
this  potential  complication  ..ere. 

3.7.  Examples  of  Extended  Gaussian  Images  (*) 

The  extended  Gaussian  image  of  a  sphere  of  radius  R 
is 

G(Z,r,)  =  R2, 

as  discussed  already. 

Perhaps  slightly  more  interesting  is  the  case  of  an 
ellipsoid  with  semi-axes  o,  6  and  e  lined  up  with  the 
coordinate  axes  (Figure  11).  An  equation  for  its  surface 
can  be  written 


Figure  II.  Ellipsoid  with  contours  obt.  cd  by  cutting 
the  surface  wit  It  three  orthogonal  planes  passing  through 
pairs  or  points  where  the  Gaussian  curvature  has  stationary 
values. 


A  normal  at  the  point 

r  =  (a  cos  0  cos  <j>,  b  sin  0  cos  <f>,  c  sin  cf>)r 
on  the  surface  is  given  by 

n  =  (tc  cos  0  cos  <f>,  ca  sin  0  cos  <)>,  ab  sin  <f>)^ , 

as  will  be  shown  later.  The  Gaussian  curvature  turns  out 
to  be  equal  to 


(be  cos  9  cos  <j>Y  +  (ca  sin  0  cos  ^)2  +  (ab  sin  $)2 


where  n2  =  n  •  n. 


More  useful  for  our  purposes  here  is  a  parametric  form 
x  =acos0cos^, 
u  =bsin0cos^, 


Figure  12.  Latitude  and  longitude  can  be  used  to  identify 
points  on  tile  Gaussian  sphere.  Each  point  on  the  Gaussian 
sphere  corresponds  to  a  unique  surface  orientation. 

ir  we  let  (  be  the  longitude  and  rj  be  the  latitude  on 
the  Gaussian  sphere,  then  a  unit  normal  at  the  point  ((,*)) 
on  the  sphere  is  given  by  (Figure  12) 

fi  =  (cos  £  cos  r),  sin  £  cos  q,  sin  rj)T . 

Now  n  =  nn.  Identifying  terms  in  the  two  expressions  for 
surface  normals  at  corresponding  points  on  the  ellipsoid 
and  the  Gaussian  sphere  we  get, 

be  cos  $  cos  4>  =  n  cos  (  cos  r), 
ca  sin  0  cos  <t>  =  n  sin  £  cost;, 


0 


so  that 

n2[(a  cos  £  cos  rj)2  +  (6cos  £  sin  r)f  +  (csinr;)2]  =  (a6c)2, 

and  finally,  substituting  for  n2  in  the  equation  for  K  we 
get 


C(^>=4 

_  abc  2 

(a  cos  ^  cos  T))2  +  (6  sin  £  cos  J7)2  +  (c  sin  T))2 

The  extended  Gaussian  image,  in  this  case,  varies  smoothly 
and  has  the  stationary  values 

(Sf  (“)'■  (t)* 


at  the  points  where  r  is  equal  to  (±1,0,0)^,  (0,  iLO)2", 
and  (0,0,  ±1)2  ,  respectively.  These  results  can  be  easily 
checked  by  sectioning  the  ellipsoid  using  the  xy,  yz  and 
zx-planes.  The  Gaussian  curvature,  in  this  case,  equals 
the  product  of  the  curvatures  of  the  resulting  ellipses. 
One  then  uses  the  fact  that  the  maximum  and  minimum 
curvatures  of  an  ellipse  with  semi-axes  a  and  b  are  a/62 
and  6/a2. 

Later  we  will  derive  the  extended  Gaussian  image  of 
a  torus,  an  object  that  is  not  convex. 


4.  DiscreteApprocimation:NeedleMaps 

Consider  the  surface  broken  up  into  small  patches  of 
equal  area.  Let  there  be  p  patches  per  unit  area.  Erect 
a  surface  normal  on  each  patch.  Consider  the  polyhedral 
object  formed  by  the  intersection  of  the  tangent  planes 
perpendicular  to  these  surface  normals.  It  approximates 
the  original  surface.  The  smaller  the  patches,  the  better 
the  approximation. 

The  extended  Gaussian  image  of  the  original  (smoothly 
curved)  convex  object  is  approximated  by  impulses  cor¬ 
responding  to  the  small  patches.  The  magnitude  of  each 
impulse  is  about  1/p,  corresponding  to  the  area  of  the 
patch  it  rests  on  (Figure  13).  Strongly  curved  areas  will 
distribute  their  impulses  over  a  large  region  on  the  Gaussian 
sphere,  while  areas  which  are  nearly  planar  will  have  them 
concentrated  in  a  small  region.  In  fact,  the  number  of 
impulses  per  unit  area  on  the  Gaussian  sphere  approaches 
p  times  the  absolute  value  of  the  Gaussian  curvature  as 
we  make  p  larger  and  larger.  This  can  be  shown  using  the 
integral  of  \/K  given  earlier. 

The  tesselation  of  the  surface  can  be  based  on  an 
arbitrary  division  into  triangular  patches  as  long  as  the 
magnitude  of  each  impulse  on  the  Gaussian  sphere  is  made 
proportional  to  the  area  of  the  corresponding  patch  on 
the  surface.  Alternatively,  one  can  divide  the  surface  up 
according  to  the  division  of  the  image  into  picture  cells. 
In  this  case  one  has  to  take  into  account  that  the  area 
occupied  in  the  image  by  a  given  patch  is  affected  by 
foreshortening.  The  actual  surface  area  is  proportional  to 


Figure  13.  Mapping  of  discrete  patches  on  an  object 
onto  the  Gaussian  sphere.  The  patches  in  this  case 
correspond  to  a  regular  tesselation  of  the  image  plane. 
Since  the  patches  lie  on  a  conical  surface  they  contribute 
to  the  extended  Gaussian  image  along  a  small  circle. 


l/(s,-  •  v),  where  s,  is  the  normal  of  the  patch,  while  v  is 
the  vector  pointing  towards  the  viewer  (Figure  4). 

Measurements  of  surface  orientation  from  images  will 
not  be  perfect,  since  they  are  affected  by  the  noise  in 
brightness  measurements.  Similarly,  surface  orientations 
obtained  from  range  data  will  be  somewhat  inaccurate. 
Consequently  the  impulses  on  the  Gaussian  sphere  will  be 
displaced  a  little  from  their  true  positions.  The  expected 
density  on  the  Gaussian  sphere  will  nevertheless  tend  to 
be  equal  to  the  inverse  of  the  Gaussian  curvature.  One 
cannot,  however,  expect  the  impulses  corresponding  to  a 
planar  surface  to  be  coincident.  Instead,  they  will  tend  to 
form  a  small  cluster.  To  be  precise,  the  effect  of  noise  is 
to  smear  out  the  information  on  the  sphere.  The  extended 
Gaussian  image  is  convolved  with  a  smoothing  function  of 
width  proportional  to  the  magnitude  of  the  noise. 

4.1.  Using  Object  Models 

Extended  Gaussian  images  also  have  to  be  computed 
for  surfaces  of  prototypical  object  models.  In  this  case  it  is 
best  to  find  a  convenient  way  to  parameterize  the  surface 
and  break  it  up  into  many  small  patches.  Suppose  the 
surface  is  given  in  terms  of  two  parameters  u  and  v  as 
r(u,v).  Then  at  the  point  (u,  ti)  we  see  that  r„  and  r„ 
are  two  tangents  (Figure  14).  The  cross  product  of  these 
tangents  is  normal  to  the  surface.  The  unit  normal 

r„  X  r„ 

n  =  5 - : 

|r„  X  r„| 

allows  us  to  determine  to  which  point  on  the  Gaussian 
sphere  this  patch  corresponds.  Suppose  that  we  divided 
the  range  of  u  into  segments  of  size  Su  and  the  range  of  v 
into  segments  of  size  Sv.  Then  the  area  of  the  patch, 

6A  =  |ru  X  r„| SuSv, 

can  be  used  to  determine  what  contribution  it  makes  to  the 
corresponding  place  on  the  Gaussian  sphere.  Note  that  we 
do  not  have  to  explicitly  compute  the  Gaussian  curvature 
or  take  second  partial  derivatives. 
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Figure  11  A  surface  normal  can  be  computed  by 
taking  tbe  cross-product  of  two  tangent  verb  rs.  The 
tangent  vectors  can  be  obtained  by  differentiation  of  the 
parametric  form  of  the  equation  of  the  surface. 

5.  Tesselationof  the  Gaussiansphere: 
Orieitation  Histograms 

It  is  useful  to  divide  the  sphere  up  into  cells  in  order 
to  represent  the  information  on  the  Gaussian  sphere  in 
a  computer.  Ideally  the  cells  should  satisfy  the  following 
criteria: 

1.  All  cells  should  have  the  same  area. 

2.  All  cells  should  have  the  same  shape. 

3.  The  cells  should  have  regular  shapes  that  arc  compact. 

1.  The  division  should  be  fine  enough  to  provide  good 
angular  resolution. 

5.  For  some  rotations  the  cells  should  be  brought  into 
coincidence  with  themselves. 

Cells  which  are  compact  combine  information  only  from 
surface  patches  which  have  nearly  the  same  orientation. 
Llougated  cells  of  the  same  area  combine  information 
from  surface  patches  which  have  more  widely  differing 
orientations.  The  area  of  a  regular  polygon  with  n  sides 
inscribed  in  a  circle  of  radius  r  is 

J  sin(2?r/n) 

[  (27r/n) 

So  the  area  of  a  hexagon  inscribed  in  a  circle  is  (3/3/2)ra, 
twice  that  of  a  triangle  inscribed  in  the  same  circle. 
Tcssclations  with  near* triangular  cells  will  thus  combine 
information  from  orientations  which  arc  \/ 2  times  as  far 
from  the  average  as  do  tcssolation  using  near  hexagonal 
cells. 

If  cells  occur  in  a  regular  pattern,  the  relationship  of 
a  cell  to  its  neighbors  will  be  the  same  for  all  cells.  Such 
arrangements  are  to  be  preferred  Unfortunately .  it  is  not 
possible  to  simultaneously  satisfy  the  criteria  listed  above. 

A  simple  tcssolation  consists  of  a  division  into  latitude 
bands,  each  of  which  is  then  further  divided  along 
longitudinal  strips  (Figure  J5).  T  he  cells  could  be  made 
more  nearly  equal  in  area  by  having  fewer  at  higher 
latitudes,  or  by  making  the  latitude  bands  wider  there, 
or  both  One  advantage  of  this  scheme  is  that  it  makes  it 
easy  to  compute  to  which  cell  a  particular  surface  normal 
should  be  assigned.  Still,  this  arrangement  docs  not  rorne 


f  igure  15.  The  Gaussian  sphere  can  be  divided  into 
cells  along  meridians  and  lines  of  longitude.  The  resulting 
ce[(s  do  not  have  the  same  areas  however  and  only  align 
with  each  other  for  certain  rotations  about  the  axis  through 
the  poles. 


Gosc  to  satisfying  the-  criteria  »taUj  above.  In  particular, 
the  cells  arc  brought  into  alignment  only  for  a  few  rotations 
about  the  axis  of  the  globe.  Rotations  about  any  other  axis 
can  not  bring  the  cells  into  alignment. 

5.1.  Tesselations  Based  on  Regular  Polyhedra 

Better  tcssclations  may  be  found  by  projecting  regular 
polyhedra  onto  the  unit  sphere  after  bringing  their  center 
to  the  center  of  the  sphere  [27]  Regular  polyhedra  are 
Uniform  and  have  faces  which  are  all  of  one  kind  of  regular 
polygon  (They  are  also  called  the  Platonic  solids).  [19,  20, 
^8  32]  I  he  vertices  of  a  regular  polyhedron  arc  congruent. 
A  division  obtained  by  projecting  a  regular  polyhedron 
has  the  desirable  property  that  the  resulting  cells  all  have 
the  same  shape  and  area.  Also,  all  cells  have  the  same 
geometric  relationship  to  their  neighbors.  In  the  case  of 
the  dodecahedron,  the  cells  are  even  fairly  well  rounded. 
I'he  dodecahedron,  however,  has  only  twelve  cells  (Figure 
Ifia).  Itven  the  icosahedron,  with  twenty  triangular  cells, 
provides  too  coarse  a  sampling  of  orientations  (Figure  ] 6b) . 
Furthermore,  its  cells  are  not  well  rounded.  Unfortunately 
there  are  only  live  regular  solids  (tetrahedron,  hexahedron, 
or lahed ron,  dodecahedron,  and  icosahedron). 


I  ig.ure  lb  I  rssi -l.it  ions  of  the  Gaussian  sphere  using  (a) 
the  regular  dodec si  hedron  and  (b)  I  lie  regular  icosahedron. 

One  ran  go  a  111  t  |r  furl  her  by  considering  semi-  regular 
polyhedra.  A  semi  regular  polyhedron  has  regular  polygons 


as  faces,  but  the  faces  are  not  all  of  the  same  kind  [19,  20, 
28  32)  (They  are  also  called  the  Archimedean  polyhedra). 
As  for  regular  polyhedra,  the  vertices  are  congruent.  There 
can  be  either  two  or  three  different  types  of  faces  and  these 
have  different  areas.  An  illustration  of  a  tessclation  using 
a  semi- regular  polyhedron  is  provided  by  a  soccer  ball 
(Figure  17a).  It  is  based  on  the  truncated  icosahedron, 
a  semi- regular  polyhedron  which  has  12  pentagonal  faces 
and  20  hexagonal  faces. 

Unfortunately  there  arc  only  13  semi- regular  polyhedra 
(The  live  truncated  regular  polyhedra,  cuboctahedron, 
icosidodecahcdron,  snub  cuboctahedron,  snub  icosidodeca- 
ht'dron,  truncated  cuboctahedron,  rhombicuboctahedron, 
truncated  icosidodecahcdron,  and  the  rhombicosidodeca- 
hedron).  Overall,  these  objects  do  not  provide  us  with  fine 
enough  tesselations.  The  snub  icosidodecahcdron  has  the 
largest  number  of  faces,  but  each  of  its  80  triangles  is 
much  smaller  than  each  of  its  12  pentagons. 

The  edges  of  a  semi-regular  polyhedron  are  all  the 
same  length.  One  consequence  of  this  is  that  the  different 
types  of  faces  have  different  areas.  The  area  of  a  regular 
polygon  of  n  sides  and  edge-length  e  equals 


■1  tan(tr/n)  ’ 

so  it  is  very  roughly  proportional  to  n“.  This  is  a  problem 
generally  with  semi-regular  polyhedra.  It  is  sometimes 
possible  to  derive  a  new  polyhedron  which  has  the  same 
adjacency  relationships  between  faces  as  a  given  semi¬ 
regular  polyhedron  hut  also  has  faces  of  equal  area.  The 
shapes  of  some  of  these  faces  then  are  no  longer  regular, 
however. 


Figure  17.  Tessclation  of  the  Gaussian  sphere  us¬ 
ing  (a)  the  truncated  icosahedron  and  (b)  the  pentakis 
dodecahedron. 

If  we  desire  a  finer  subdivision  still,  we  can  consider 
splitting  each  face  of  a  given  tessclation  further  into 
triangular  facets,  If.  for  example,  we  split  each  pentagonal 
face  of  a  dodecahedron  into  five  equal  triangles  we  obtain 
a  pentakis  dodecahedron  with  W)  fa-  os  (f  igure  17b).  This 
happens  to  be  the  dual  of  the  truncated  icosahedron, 
discussed  above  If  we  apply  this  method  instead  to  the 
truncated  icosahedron  we  construct  an  object  with  180 
fares.  This  object,  as  well  as  the  pentakis  dodecahedron, 


form  suitable  bases  for  further  subdivision,  as  we  shall 
show  later. 

To  see  how  fine  a  division  we  might  need,  let  us 
calculate  the  angular  spread  of  surface  normals  which  map 
into  a  particular  cell.  If  there  are  n  equal  cells,  then  each 
one  will  have  area 

A  ~  (4jt )/n, 

since  the  total  area  of  the  unit  sphere  is  in  (This  area 
equals  the  solid  angle  of  the  cone  formed  by  the  cell  when 
connected  to  the  center  of  the  sphere).  'Flic  shape  which 
minimizes  the  angular  spread  for  given  surface  area  is  the 
circular  disc.  The  area  of  a  circular  disc  on  trie  unit  sphere 
is 

A  =  2tt(1  —  cost?), 

where  0  is  the  half-angle  of  the  cone  formed  by  the  disc 
when  connected  to  the  center  of  the  sphere.  If  t?  is  small, 
the  area  can  be  approximated  by 

A  7T 02. 

Thus  if  there  are  many  cells  and  if  they  could  be  made 
circular,  the  angular  spread  would  be 

0  sk  2/y/n. 

The  best  we  can  hope  for,  however,  are  near-hexagonal 
cells.  The  area  of  a  hexagon  inscribed  in  a  circle  of  radius  r 
is  (3/3/2)r-\  as  already  mentioned.  The  area  of  the  circle, 
■nr1,  is  about  20%  more.  So  a  hexagonal  shape  has  a  spread 
which  is 

j  27r-  =  1.0996... 

W3 

as  large  :ls  that  of  a  circular  shape  of  equal  area.  A  lower 
bound  on  the  angular  spread  for  a  tessclation  with  n  cells 
then  is 

_  ;  4n  _  2.1993.  .. 

~  i  3v/3n  “ 

For  ti  =  GO,  for  example,  the  spread  is  at  greater  than 
1(>.2\  One  should  also  remember  that  the  spread  for 
triangular  cells  is  even  more,  namely  y/2  times  that  for 
hexagonal  cells. 

5.2.  Geodesic  Domes 

To  proceed  further,  we  can  divide  the  triangular  cells 
into  four  smaller  triangles  according  to  the  well  known 
geodesic  dome  constructions  [27,  30,  33)  We  attain  high 
resolution  by  relenting  on  several  of  the  criteria  given  above 
(Figure  18).  Specifically,  the  cells  of  a  geodesic  tasselation 
do  not  all  have  the  same  area  and  shape.  The  cells  are 
also  not.  compact,  being  shaped  like  (irregular)  triangles. 
The  duals  of  geodesic  domes  are  better  in  this  respect, 
since  they  have  facets  that  are  mostly  (irregular)  hexagons, 
with  a  dozen  (regular)  pentagons  thrown  in.  Tesselations 
of  arbitrary  fineness  can  be  constructed  in  this  fashion. 
The  pentakis  dodecahedron  is  a  good  starting  point  for 
a  geodesic  division,  as  is  the  object  constructed  earlier 


from  the  truncated  icosahedron  by  dividing  the  faces  into 
triangles. 

Each  of  the  edges  of  the  triangular  cells  of  the  original 
polyhedron  are  divided  into  /  sections,  where  /  is  called  the 
frequency  of  the  geodesic  division.  The  result  is  that  each 
face  is  divided  into  /“  (irregular)  triangles.  Tcsselations 
where  the  frequency  is  a  power  of  two  arc  particularly  well 
suited  to  the  method  suggested  here,  as  we  sec  next. 

One  has  to  be  able  to  efficiently  compute  to  which 
cell  a  particular  surface  normal  belongs.  In  the  case  of 
the  tcsselations  derived  from  regular  polyhedra,  one  first 
computes  the  dot-product  of  the  given  unit  vector  and  the 
vector  to  the  center  of  each  cell  (These  reference  vectors 
correspond  to  the  vertices  of  the  dual  of  the  original  regular 
polyhedron).  This  gives  one  the  cosine  of  the  angle  between 
the  two.  The  closest  reference  vector  is  the  one  which  gives 
the  largest  dot-product.  The  given  vector  is  then  assigned 
to  the  cell  corresponding  to  that  reference  vector. 


figure  18.  Tcssclation  of  the  Gaussian  sphere  us¬ 
ing  a  frequency  four  geodesic  tesselation  based  on  the 
icosahedron  (There  arc  16  X  20  =  320  faces). 

In  the  case  of  a  geodesic  dome,  it  is  possible  to 
proceed  hierarchically,  particularly  if  the  frequency  is  a 
power  of  two.  The  geodesic  dome  is  based  on  some  regular 
polyhedron.  The  appropriate  facet  of  this  polyhedron  is 
Tound  as  above.  Next,  one  determines  into  which  of  the 
triangles  of  the  first  division  of  this  facet  the  given  unit 
normal  falls.  This  can  be  done  by  considering  which  dot- 
product  has  the  second  largest  value.  No  new  dot- prod  acts 
need  to  be  computed.  The  process  is  then  repeated  with 
the  four  triangles  into  which  this  facet  is  divided,  and  so 
on.  In  practice,  lookup  table  methods  can  be  used,  which, 
while  not  exact,  are  very  quick. 

Let  the  area  occupied  by  one  of  the  cells  on  the 
Gaussian  sphere  be  ui  (in  the  case  of  the  icosahedron 
ui  =  4tt/20).  The  expected  number  of  surface  normals 
mapped  into  a  cell  equals 

pw|G|, 

for  a  convex  object,  where  U  is  the  average  of  G(£,q)  over 
the  cell. 


It  is  clear  that  the  extended  Gaussian  image  can 
be  computed  locally.  One  simply  counts  the  number  of 
surface  normals  that  belong  in  each  cell.  The  expression  for 
the  Gaussian  curvature,  on  the  other  hand,  includes  first 
and  second  partial  derivatives  of  the  surface  function.  In 
practice,  estimates  of  derivatives  arc  unreliable,  because  of 
noise.  It  is  important  therefore  that  the  extended  Gaussian 
image  can  be  computed  without  estimating  the  derivatives. 

The  values  in  the  cells  can  be  thought  of  a a  an 
orientation  histogram.  It  has  recently  be  brought  to  my 
attention  that  this  is  analogous  to  a  scheme  used  for 
histogramming  directions  of  dendrites  on  neurons  |3-1). 

The  result  can  be  displayed  graphically  using  normal 
vectors  on  each  of  the  cells  to  represent  the  weight  of  the 
accumulated  surface  normals.  A  frequency  two  sub-division 
of  the  pentakis  dodecahedron  provides  240  cells,  enough  for 
most  practical  purposes  (Figure  19).  The  angular  spread  in 
this  case  is  about  11.5°.  An  alternative  way  to  present  the 
extended  Gaussian  image  graphically  is  by  means  of  a  grey- 
level  image  where  brightness  in  each  cell  is  proportional 
to  the  count.  The  surface  of  the  Gaussian  sphere  may  be 
projected  stereographically  instead  of  orthographically  in 
order  to  preserve  the  shapes  of  the  cells.  Their  areas  will 
be  scaled  unequally  however. 

A  further  refinement  or  the  orientation  histogram  has 
us  store  the  sum  of  the  vectors,  scaled  according  to  the 
area  of  the  corresponding  patch,  rather  than  just  the  sum 
of  the  areas  of  the  patches.  This  requires  three  times  as 
much  memory  space,  but  provides  more  accuracy.  In  fact, 
in  the  case  of  polyhedra,  this  representation  is  exact. 


Figure  19.  Orientation  Histogram  collected  on  a 
geodesic  dome  derived  fnm  the  pentakis  dodecahedron 
(There  arc  12X5X4“  240  faces).  This  is  a  discrete 
approximation  of  the  extended  Gaussian  image.  The  length 
of  the  vector  at  inched  to  the  center  of  a  cell  is  proportional 
to  the  number  of  surface  normals  on  the  surface  of  the 
original  object  which  have  orientations  falling  within  the 
range  of  directions  spanned  by  that  cell. 


6.  Solidsof  Revolution 

In  the  case  of  the  surface  of  a  solid  of  revolution, 
the  Gaussian  curvature  is  rather  easy  to  determine.  The 
solid  of  revolution  can  be  produced  by  rotating  a  (planar) 
generating  curve  about  an  axis  (Figure  20).  Let  the 
generating  curve  be  specified  by  the  perpendicular  distance 
from  the  axis,  r(s),  given  as  a  function  of  arc  length  s  along 
the  curve.  Let  0  be  the  angle  of  rotation  around  the  axis. 
Now  consider  the  Gaussian  sphere  positioned  so  that  its 
axis  is  aligned  with  the  axis  of  the  solid  of  revolution.  Let 
£  be  the  longitude  and  rj  be  the  latitude  on  the  Gaussian 
sphere. 

We  can  let  £  correspond  to  0.  That  is,  a  point  on  the 
object  produced  when  the  generating  curve  has  rotated 
through  an  angle  0  has  a  surface  normal  that  lies  on  the 
Gaussian  sphere  at  a  point  with  longitude  £  =  9. 


Figure  20.  A  solid  of  revolution  can  be  generated  by 
rotating  a  curve  around  an  axis.  The  curve  can  be  specified 
by  giving  the  distance  form  the  axis  as  a  function  of  the 
arc  length  along  the  curve. 

6.1.  Gaussian  Curvature  of  Solid  of  Revolution 

Consider  a  small  patch  on  the  Gaussian  sphere  lying 
between  (  and  (  -fi  6£  in  longitude  and  between  r)  and 
rj  +  Sri  in  latitude.  Its  area  is 

cos  r]6£6ri. 

We  need  only  determine  the  area  of  the  corresponding 
patch  on  the  object.  It  is 

rSBSs, 

where  6a  is  the  change  in  arc  distance  along  the  generating 
curve  corresponding  to  the  change  6t]  in  surface  orientation. 
The  Gaussian  curvature  is  the  limit  of  the  ratio  of  the  two 
areas  as  they  tend  to  sero.  That  is, 

coat)  6(6rj  cos rj  6rj  cos rj  dr\ 

K  =  lim - —  —  =  lim - —  = - — . 

;»-•  r  d6  da  <e->0  r  6$  r  da 


since  <J£  =  60.  The  curvature  of  the  generating  curve,  kg, 
is  just  the  rate  of  change  of  direction  with  arc  length  along 
it  (22  24).  So 


and  hence 


K  = 


kg  cos  r) 
r 


It  is  easy  to  see  that  (Figure  21),  sinr?  =  —  r„,  where  r,  is 
the  partial  derivative  of  r  with  respect  to  s.  Differentiating 
with  respect  to  s  we  get 


cost; 


dt? 

da 


— r»»> 


and  so  we  obtain  the  simple  formula 


In  the  case  of  a  sphere  of  radius  R,  for  example,  we 
have  r  =  Rcos(a/R )  for  — (tr/2)/f  <  s  <  +(ir/2)R.  Thus 
r„  =  -(r/R2)  and  K  =  l/R2. 


Figure  21.  This  figure  shows  the  relationships  between 
infinitesimal  increments  in  arc  length  along  the  curve, 
distance  from  the  axis  of  rotation  and  distance  along  the 
axis  of  rotation. 


For  some  purposes  it  is  more  useful  to  express  the 
radius  r  as  a  function  of  the  distance  along  the  axis,  rather 
than  as  a  function  of  arc  length  along  the  curve.  Let  the 
distance  along  the  axis  be  denoted  by  z.  It  is  easy  to  see 
that  (Figure  21),  tanr;  =  -r„  and  so,  differentiating  with 
respect  to  8t 


7  dr)  o  /  x  u* 

sec  =  -v(-r*)  -  -r”T*' 


dz 


'  da  da 
uhere  from  the  figure  we  see  that  cost 


da’ 


.  that 


KQ  cos  rj 


dn  4 

—  cost;  =  — r„cos  rj, 
da 


and  finally 


K  =  - 


r(l+rj)2’ 


since 

sec2 1 ;  =  1  +  rj. 
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6.2.  Alternate  Derivation  of  Gaussian  Curvature  of 
a  Solid  of  Revolution  (*) 

We  first  need  to  review  Mcusnier’s  theorem  [22-24]. 
Consider  a  normal  section  of  a  surface  at  a  particular 
point.  It  is  obtained  by  cutting  the  surface  with  one  of 
the  planes  including  the  local  normal  (Figure  22).  Suppose 
that  the  curvature  of  the  curve  in  which  the  surface  cuts 
this  plane  is  icy.  Now  imagine  lilting  the  plane  away  from 
the  normal  by  an  angle  t;  (using  the  local  tangent  as  an 
axis  to  rotate  about).  The  new  plane  will  cut  the  surface 
in  a  curve  with  higher  curvature.  In  fact,  it  can  be  shown 
that  the  new  curve  has  curvature 

kn/costj. 

It  is  easy  to  see  this  in  the  case  of  a  sphere,  since  a 
plane  including  the  center  cuts  the  sphere  in  a  great  circle, 
while  an  inclined  plane  cuts  it  in  a  small  circle  of  radius 
proportional  to  the  cosine  of  the  angle  of  inclination. 


Figure  22.  The  curvature  of  the  curve  obtained  by 
cutting  a  surface  using  an  inclined  plane  is  greater  than 
that  obtained  by  cutting  it  using  a  plane  which  includes 
the  surface  normal.  Meusnicr’s  theorem  tells  us  that  the 
ratio  of  the  two  curvatures  is  equal  to  the  cosine  of  the 
angle  between  the  two  planes. 

Now,  let  us  return  to  the  surface  of  revolution.  It  is 
not  hard  to  show  that  one  of  the  principal  curvatures  at  a 
point  on  the  surface  will  correspond  to  a  cut  through  the 
surface  by  a  plane  which  includes  the  axis  of  revolution. 
The  curve  obtained  in  this  way  is  just  the  generating  curve 
of  the  solid  of  revolution.  So  one  of  the  two  principal 
curvatures  is  equal  to  the  curvature  K(;  of  the  generating 
curve  at  the  corresponding  point. 

Now  consider  a  plane  perpendicular  to  the  axis  of 
revolution  through  the  same  surface  point  (Figure  23).  It 
cuts  the  surface  in  a  circle.  The  curvature  in  this  plane 
equals  (1/r),  where  r  is  the  radius  of  the  solid  of  revolution 
at  that  point.  This  horizontal  plane,  however,  is  not  a 
normal  section.  Suppose  that  the  normal  makes  an  angle 


Figure  23.  If  the  solid  of  revolution  is  cut  by  a  plane 
perpendicular  to  the  axis  of  rotation,  a  circle  is  obtained. 
The  curvature  in  this  plane  is  just  the  inverse  of  the 
distance  of  the  surface  from  the  axis.  The  curvature  of 
the  corresponding  normal  section  can  be  obtained  using 
Meusnier's  theorem. 


relative  to  this  plane  (The  local  tangent  plane  also  makes 
an  angle  r]  relative  to  the  axis  of  revolution).  Now  construct 
the  plane  including  the  local  normal  which  intersects  the 
horizontal  plane  in  a  line  perpendicular  to  the  axis.  This 
plane  will  be  inclined  tj  relative  to  the  one  we  have  just 
studied.  It  also  produces  the  second  principal  normal 
section  sought  after.  By  Meusnier’s  theorem  we  see  that 
the  curvature  of  the  curve  found  in  this  normal  section  is 
K.ft  =  (1/r)  cos  rj.  Finally,  the  Gaussian  curvature  is  found 
by  multiplication  to  be 

K  =  *c  c°3>? 
r 

In  the  case  of  a  sphere  of  radius  R,  for  example,  we 
have  r  =  R  cos  77  and  itg  =  1/R,  so  that  K  —  1/R2,  as 
expected. 

To  make  this  result  more  usable,  erect  a  coordinate 
system  with  the  z-axis  aligned  with  the  axis  of  revolution. 
The  generating  curve  is  given  as  r(z).  Let  the  first  and 
second  derivatives  of  r  with  respect  to  z  be  denoted  by 
r.  and  rzt  respectively.  It  is  easy  to  sec  that  (Figure  21), 
tan  77  =  rz,  so  that 


Furthermore, 


so  that  finally, 


/l+r'i 


(1  +  rj)3/*’ 


K  =  -- 


r(l+r?)2- 

In  order  to  use  this  result  in  deriving  extended  Gaussian 
images  it  is  necessary  to  identify  points  on  the  surface  with 


S'v'v'l 


v  *.  %  *.  «, 


points  on  the  Gaussian  sphere.  Suppose  that  we  introduce 
a  polar  angle  0  such  that 

x  —  r  cos  0  and  y  =  r  sin  9. 

Then  a  unit  normal  to  the  surface  is  given  by 
(cos  0,  sin  0,  —  r,)7" 

yr+rf 

Equating  this  to  the  unit  normal  on  the  Gaussian  sphere, 
(cos  £  cost;,  sin  (  cost/,  sinrj)r, 

we  get 

£  —  0  and  tanr )  — 


6.3.  Extended  Gaussian  Image  of  a  Torus 


As  an  illustration  wc  will  now  determine  the  extended 
Gaussian  image  of  a  torus.  Let  the  torus  have  major  axis 
R  and  minor  axis  p  (Figure  24).  A  point  on  the  surface 
can  be  idcntiGcd  by  0  and  s,  where  0  is  the  angle  around 
the  axis  of  the  torus,  while  s  is  the  arc  length  along  the 
surface  measured  from  the  plane  of  symmetry.  Then, 

r  ~  R  +  pcos(a/p), 

and 

r«  =  ~(l/p)cos(s/p), 

so  that 

K  =  '  =  - _ cas(8/f) 

r  p  R  +  pcos(s/p) 

Two  points,  P  and  P1  (Figure  24),  separated  by  n  in  0, 
have  the  same  surface  orientation  on  the  torus.  The  surface 
normal  at  one  of  these  places  points  away  from  the  axis 
of  rotation,  while  it  points  towards  the  axis  at  the  other 
place.  Accordingly,  two  points  on  the  object, 

(®.  3)  =  (£.  PV)  and  (9,  s)  =  (£  +  7T,  p{ir  -  »?)), 

correspond  to  the  point  (£,  rj)  on  the  Gaussian  sphere.  The 
curvatures  at  thfse  two  points  have  opposite  signs 


A\ 


1  cos  r\ 
p  R  +  p  cos  r) 


and 

p  R  —  p  cos  rj 


The  torus  is  not  a  convex  object,  so  more  than  one  point 
on  its  surface  contributes  to  a  given  point  on  the  extended 
Gaussian  image.  If  we  add  the  absolute  values  of  the 
inverses  of  the  curvature  we  get 


2/ip  sect;. 


If  wc  had  added  the  inverses  algebraically  instead  we  would 
have  obtained 


1  1 

K+  +  K- 


2pl, 


which  is  twice  the  result  for  a  sphere  of  radius  p. 
The  same  results  could  have  been  found  using 


K  = 


K(;  cos  r) 


since  Kg  =  —1/p  and  r  =/?  ±  p cost?,  so  that 

K  =  ±C0STI 

p[R  ±  pcostj)' 

The  extended  Gaus/ian  image  of  a  torus  has  singularities 
at  the  poles.  These  correspond  to  the  two  rings  on  which 
the  torus  would  rest  if  it  were  dropped  onto  a  plane.  All 
of  the  points  on  one  of  these  rings  have  the  same  surface 
orientation. 

Wc  can  think  of  the  Gaussian  sphere  as  covered  by 
two  sheets  of  a  Ricmann  surface,  one  corresponding  to 
the  inner  half  of  the  torus,  closer  to  its  axis  of  symmetry, 
the  other  corresponding  to  the  outer  half.  The  two  sheets 
arc  connected  to  one  another  at  the  poles,  branch  points 
corresponding  to  the  two  rings  mentioned  above.  There 
the  Gaussian  curvature  changes  sign. 


Figure  24.  A  torus  obtained  by  spinning  a  circle  around 
an  axis.  The  resulting  object  is  not  convex.  Its  extended 
Gaussian  image  can  be  romputed  nevertheless. 


Wc  may  also  note  at  this  point  that  all  tori  with 
the  same  surface  area,  (4 n2pR),  have  the  same  extended 
Gaussian  image. 


6.4.  The  Unique  Convex  Object  with  C(£ ,  rj)  ~  2  sec  rj 

n 


While  all  tori  with  surface  area  lx*  have  the  same 
extended  Gaussian  image, 

(?(f.»7)  =  2  see  q, 

there  is  only  one  convex  object  which  has  that  extended 
Gaussian  image.  It  is  a  solid  of  revolution  since  G({,r;)  is 
independent  of  So  wc  have  on  the  one  hand 


K  =-  1/2  cos  q, 
and  on  the  other  hand 

K  ^  COS  T/ 

r  ' 

so  that 


K.,;  —  r/2. 

The  equation  states  that  the  curvature  of  the  generating 
curves  varies  linrarly  with  the  distance  from  the  axis  of 
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rotation.  This  deceptively  simple  equation  represents  a 
non-linear  second  order  differential  equation  for  r  in  terms 
of  z  since 

KG  =  "(14 -r2)2/2’ 

so  that 

r„  =  _r(l+r^. 

Now 

d  1 _ rtr„ 

and 

d  r2  rr, 

Tz~i  ~  2  ’ 

so  that 

d  1  _  d^  r2 

^yTTrl 

or 

4 

where  e2  is  a  constant  of  integration.  We  now  have  reduced 
the  problem  to  a  non-linear  first  order  differential  equation 
for  r  in  terms  of  z.  If  the  object  is  to  be  convex  and  smooth 
at  its  poles,  we  expect  rz  — *  oo  as  r  — *  0.  Thus  c  =  0.  Next 
note  that  the  term  on  the  left  equals  cost;.  So  we  also  have 

r2 

cos,  -  - 

or,  using  an  earlier  expression  for  K(;, 

<G  =  -i/cosrj- 

This  is  an  implicit  equation  for  the  curve  of  least  energy 
[35)1  The  curve  of  least  energy  is  the  curve  which  minimises 
the  integral  of  the  square  of  the  curvature  k^.  It  can  be 
solved  for  z  in  terms  of  r  to  yield 

z  =  n/2[2  E(  cos  1  (r  /2),  1/ >/2)  —  F(cos"'(r/2),  l/>/2)]. 

where  E  and  F  are  incomplete  elliptic  integrals.  If  we  let 
$  be  the  arc  length  along  the  curve  we  can  also  write  the 
solution  in  Whewell  form 

*  =  \/?.F[co^~i  ^/cosrv  l/v^2)i 
or  C£saro  form 

s  =  \/2F(cos-1(— *c).  l/V*)- 

The  length  of  the  curve  from  the  pole  to  the  equator  is 
V2KH/V2)  =  v/2ff(sin(Tr/4))  = 

2v'2x 

where  K  is  the  complete  elliptic  integral  of  the  first  kind, 
and  T  is  the  gamma-function  (24j.  The  height  from  equator 

to  the  pole  is 


W  = 

while  the  maximum  radius  is 

II  =  2. 

The  minimum  radius  of  curvature  equals  one,  so  that  a 
circle  tangent  at  the  outermost  point  is  also  tangent  at  the 
origin  [35j.  This  circle,  when  rotated  about  the  vertical 
axis,  produces  a  torus  with  the  same  extended  Gaussian 
image  (Figure  25).  Both  objects  have  total  surface  area 
4t2. 


Figure  25.  The  unique  convex  object  with  the  same 
extended  Gaussian  image  as  a  torus  has  an  interesting 
shape.  It  is  a  solid  of  revolution  whose  generating  curve 
is  the  curve  of  least  energy.  This  is  the  shape  which  a 
uniform  bar  constrained  to  pass  through  two  points  in 
space  with  given  orientation  will  adopt. 

7.  GaussianCurvnturein  the  GeneralCase 

When  the  object  is  not  a  solid  of  revolution  we  need  to 
work  a  little  harder  to  obtain  the  Gaussian  curvature.  Let 
x  =  z(u,  0),  y  =  y(ti,v),  and  z  =  z(u,v)  be  parametric 
equations  for  points  on  a  given  surface.  Let  r  =  (x,  y,  z)T 
be  a  vector  to  a  point  on  the  surface.  Then 

dr  ,  dr 

Fu  =  du  Md  fv==di 

are  two  tangents  to  the  surface,  as  already  noted  earlier. 
The  cross  product  of  these  two  vectors, 

a  -  fy  X  fy, 

will  be  perpendicular  to  the  local  tangent  plane  (Figure 
14).  The  length  of  this  normal  vector  squared  equals 

n2  =  n  •  n  =  (r„  •  r u)(rv  ■  rv)  -  (r„  •  r„)2 

since  (a  X  b)  •  (c  X  d)  ==  (a  •  c)(b  •  d)  —  (a  •  d)(b  •  c).  A  unit 
vector  n  =  n/n  can  be  computed  using  this  result. 

7.1.  Gaussian  Curvature  from  Variation  in  Normals 

n 


(2*)J/a 

r(i/4)a 


The  Gaussian  curvature  is  the  limit  of  the  ratio  of 
the  area  of  a  patch  on  the  Gaussian  sphere  to  the  area 
of  the  corresponding  patch  on  the  surface,  as  the  area 
shrinks  to  sero.  Consider  an  infinitesimal  triangle  formed 
by  the  three  points  on  the  surface  corresponding  to  (u,  v), 
(u  +  6u,  ti),  and  (u,  ti  +  Sv).  The  lengths  of  two  sides  of  this 
triangle  are 

|ru|6u  and  |r„|iv 

while  the  sine  of  the  angle  between  these  sides  equals 
|ru  X  ru| 

lr«l  k.l 

so  that  an  outward  normal  with  size  equal  to  the  area  of 
the  triangle  is  given  by 

~(pu  X  r„)tfu£t>  =  ^n£u£v. 

To  determine  the  area  of  the  corresponding  triangular 
patch  on  the  Gaussian  sphere  we  need  to  find  the  unit 
surface  normals  at  the  three  points.  The  unit  surface 
normals  will  be 

n,  n  +  n„6u,  and  n  +  n„4u 
if  we  ignore  terms  of  higher  order  in  Su  and  6v.  Here  n„ 
and  n„  are  the  partial  derivatives  of  n  with  respect  to 
u  and  v.  Note  that  n„  and  n„  are  perpendicular  to  n. 
The  area  of  the  patch  on  the  Gaussian  sphere  equals  the 
magnitude  of 

^(n„  X  n„)6u6u 

It 

by  reasoning  similar  to  that  used  in  determining  the  area 
of  the  original  patch  on  the  given  surface.  We  need  to  find 
n„  and  n„  to  compute  this  area.  Now 

_ _ d  n  nn„  —  nnu 

“  “  du  n  ~  n2  ' 

From  n2  =  n  •  n  we  get 


n  nu  =  n  •  n„ 


so  that 


-  _  (n  •  n)nu  -  (n  •  n„)n  (n  X  nu)  X  n 

n  u - 5 - —  - ; - , 

n'> 


.  (n  •  n)n„  -  (n  •  n„)n  (n  X  n„)  X  n 

nv  —  =  — - s - , 

n3  n3 

since  (a  X  b)  X  c  =  (a  •  c)b  —  (b  •  c)a.  Then 

n„  X  n  =  ^  [(n  •  n)(n„  X  n„)  +  (n  ■  nu)(n„  X  n) 
+  (n  •  n„)(n  X  n„)] 


Wfi  X  Ho  —  j  [n  fin  njn 

since  [abc]p  =  (ap)(b  X  c)  +  (b  p)(c  X  a)  +  (c  p)(aX  b), 
where  [a be]  =  (a  X  b)  -  c. 

This  shows  that  the  patch  on  the  Gaussian  sphere  has 


the  same  orientation  as  the  patch  on  the  surface,  as  it 
should.  An  outward  pointing  normal  of  size  equal  to  the 
area  is  given  by 

x— r[n  n„njn6u5u. 

2b1 

The  ratio  of  the  two  areas,  the  Gaussian  curvature, then 
just 

K _  [n  nM  fit,] 


so  that 


n  —  r„  X  Pu» 


n„  —  Pun  X  p„  +  pu  X  ru„, 


n,,  —  P«u  X  Pv“hPu  X  Puu* 

using  (a  X  b)  X  (c  X  d)  =  [abd]c  —  [abc]d  or  (a  X  b)  X 
(c  X  d)  =  [a  c  d]b  —  [bedjawe  get 

nu  X  n„  =  “(Puu  P«  Pt>u]P u  +  [Puu  rv  Puv]Pu 
[p«u  r v  ru]ruii 
+  [pu  Pun  P|>]pu» 
d"  [ru  Puu  ruu]Pu 

80  that 

[n  nu  n„]  —  n*(nu  Xn„]  =  [ru  r„  ruu][ru  PB  Puu]“ku  rw  Puu]2 
,  and  finally 

K  _  [puPu  r„„][pup,  ]  [rU  Pi  Puu]2 

|p„  X  Pul4 

This  result  can  be  used  to  derive  the  expression  for 
curvature  of  a  solid  of  revolution  in  a  more  rigorous 
fashion. 

7.2.  Fundamental  Forms  of  a  Surface  (*) 


Let,  as  before, 


Ty  X 


ku  X  r»r 

be  the  unit  surface  normal  vector.  The  first  fundamental 
form  of  a  surface  gives  the  square  of  the  element  of  distance 
as  [24] 

ds2  =  |dr|2  =  E(u,v)du 2  +  2  F(u,v)dudv  +  G(u,  v)dv2. 

The  second  fundamental  form  of  a  surface  gives  the  normal 
curvature  using  the  equation  [24] 

—dr  •  dn  =  L(u,  u)  du2  +  2M(u,  t>)  du  dv  +  N(u,  v)  dv 2. 

The  coeflicients  can  be  expressed  in  terms  of  derivatives  of 
r  as  follows, 

E  =  ru  ■  ru,  F  =  ru  ■  r„,  and  G  —  r„  •  r„ 

and 

L  =  ru  -nu,  M  =  r„  •  n„  =  r„  •  nu,  and  N  =  rv  nv, 


r  [rMrvruu]  i/  [ru  Tti  rut>] 

VEG^T2’  VEGAS’ 


so  that  [24] 


[r«  rP  rtm] 

veg^f* 


LN-M2 


EG  -  F2  ' 

Finally,  if  the  surface  is  given  as  z(x,y),  the  above  reduces 
to  the  familiar 

_  ztzzyy  ~  ziy 

A  ~  (l+zl  +  z2)2' 

7.3.  Application  of  the  General  Formula  to  the 
Ellipsoid  (*) 

In  the  case  of  the  ellipsoid  we  have,  as  discussed 
before, 

r  =  (a  cos  0  cos  <j>,  b  sin  9  cos  <j>,  c  sin  <fr)T , 
r^  =  (— a  sin  co9  (/>,  bcosO  cos  <j>,  0)T, 

Tv  =  (— a  cos  0  sin  <6,  — hsin  9  sin  <6,  c  cos  4>)T , 

and 

rtf  =  (—a  cos  9  cos  4>,  —6  sin  0  cos  f>,  0)r, 

r^  =  (asinflsin^,—  6cos0sin  ^,0)T, 

r^  =  (— a  cos  9  cos  <j>,  —  6  sin  6  cos  4>,  — c  sin  d>)T  ■ 

A  surface  normal  can  be  found  by  taking  cross  products, 
n  ==  rt  X  r*  =  (be  cos  0  cos  </>,  ca  sin  0  cos  <j>,  ab  sin  $)T  cos  4>, 
and  the  coefficients  of  the  first  fundamental  form  are, 

E  =  r*  ■  tg  =  (a2  sin2  0  +  b2  cos2  0)  cos2  <b, 

F  =  r$  ■  rj  =  (a2  +  62)  sin  0 cos 0  sin  4>  cos  tj>, 

G  =  ■  r+  =  (a2  cos2  0  +  62  sin2  0)  sin2  <£  +  e2  cos2  0. 

Hence 

fi'C  —  F2  =  [(6c  cos  0  cos<b)2  +  (ca  sin  0cos^i)2 

+  (a6sin  ^)2]  cos2  <j> . 

To  compute  the  coefficients  of  the  second  fundamental 
form  we  need, 

[rs  r^  rss]  =  n  •  too  =  — a6c  cos3  <f>, 

(r«  r^  roj,]  =  n  •  =  0, 

[rj  r^  r^]  =  n  •  r^  =  — a6c  cos  <t>, 

and  so 

[r*  r#  roj][r*  r*  r^|  -  [r*  r4  r^|2  =  (a6e  cos2  <b)2. 
Finally  then, 

LN-  = 

eg-f2  ’ 


LN-M 2 
EG-F2 

_  abc  1 

(be  cos  6  cos  <j>)2  +  (ca  sin  0  cos  <j>)2  +  (ab  sin  <fi)2 

This  result  was  used  earlier  in  the  discussion  of  the  extended 
Gaussian  image  of  the  ellipsoid. 


8.  Summaryand  Conclusions 

We  have  defined  the  extended  Gaussian  image,  dis¬ 
cussed  its  properties  and  given  examples.  Methods  for 
determining  the  extended  Gaussian  images  of  polyhedra, 
solids  of  revolution  and  smoothly  curved  objects  in  general 
were  shown.  The  orientation  histogram,  a  discrete  ap¬ 
proximation  of  the  extended  Gaussian  image,  was  described 
along  with  a  variety  of  ways  of  tcssclating  the  sphere. 
Machine  vision  methods  for  obtaining  the  surface  orien¬ 
tation  information  required  to  build  an  orientation  his¬ 
togram  are  discussed  elsewhere  [1,  3  8).  Extended  Gaussian 
images  based  on  object  models  can  be  matched  with  those 
derived  from  experimental  data.  The  application  of  ex¬ 
tended  Gaussian  images  to  object  recognition  and,  more 
importantly,  to  finding  the  attitude  in  space  of  an  object, 
are  discussed  in  a  recent  article  [17]. 
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Abstract 

It  has  been  known  for  some  time  that  a  bilaterally 
symmetric  figure  in  an  arbitrarily  oriented  plane  P. 
viewed  under  orthography,  yields  a  skew  symmetric 
figure  whose  axes  of  symmetry  and  skew  constrain  the 
orientation  of  P  to  a  one-parameter  subspace  of  the 
spherical  space  of  orientation  vectors. 

In  recent  work  aimed  at  finding  skew  and  symmetry 
axes  of  shapes,  we  used  the  measured  moments  of  input 
skewed  figures  to  constrain  their  symmetry  and  skew 
axes  to  a  one-parameter  subspace  of  the  toroidal  space  of 
axis  angle  pairs.  This  subspace  may  be  searched  for  those 
axis  pairs  yielding  maximum  symmetry  in  the  original 
figure. 


Here,  as  in  previous  work,  we  ignore  the  problem  of 
searching  the  locus  —  in  fact,  in  our  work  we  evaluate 
ail  points  on  it.  The  main  content  of  the  following 
sections  is  the  various  evaluators  for  (a.y)  pairs  in  the 
search.  Each  of  these  evaluators  may  be  simply  regarded 
as  a  measure  of  (non-skewed  bilateral)  symmetry.  To 
evaluate  an  (a.y)  pair,  the  skewed  figure  is  (implicitly  or 
explicitly)  "deprojected"  through  the  (a.y)  skew,  that  is 
applying  the  inverse  skew,  and  the  symmetry  of  the 
resulting  figure  (across  the  x  axis)  is  evaluated  (Fig.  1). 

Briefly,  to  recapitulate  the  mathematics  of  the 
constraint  on  (a,y).  first  note  that  the  3-D  orthographic 
projection  is  equivalent  to  a  2-D  skew  of  original  points 
x0  to  yield  points  Xj 

Xj  =  x0S,  where 


Here  we  briefly  review  the  original  constraint  and 
then  consider  several  symmetry  evaluators  for  use  in 
axis-finding.  The  most  reliable  proves  to  be  one  based 
on  the  mass  in  radial  segments  of  the  figure. 

Key  words:  symmetry  measure,  skewed  symmetry, 
symmetry  constraint,  discrete  image  shape. 

1.  Skewed  Symmetry  and  The  Axis  Constraint 

A  symmetric  shape  on  a  plane  P  viewed 
orthographically  from  an  arbitrary  viewpoint  exhibits 
skewed  symmetry.  In  the  image,  let  the  angle  of  the  axis 
of  symmetry  from  the  horizontal  be  a,  the  angle  of  the 
axis  of  skew  from  the  horizontal  be  y.  The  skew  of  the 
figure  is  y  -  a  -  fi. 


S  = 


COS  a 


sin  a 


COS a  COXft  -  sina  sina  COlfi  +  COSa 
The  original  figure  has  a  second-moment  matrix 


^mll 
in  which  itij] 


m„ 

m02 

0. 


The  moment  matrix  for  the  skewed  image  shape  is 
M'  =  S^MS,  and  this  implies  the  constraint  C: 

a  =  tan‘l((m  utan  a-nr^Vlnr-jotan  y-m  n)) 


Any  two  of  a,  /?,  y  constrain  the  plane  P  to  a 
hyperbola  of  orientations  in  gradient  space  (Kanade, 
1979;  Render.  1978;  Stevens.  1979|.  A  method  for 
finding  a,  p.  and  y  from  image  input  was  given  in 
(Friedberg,  1984;  Friedberg  and  Brown,  1984j.  Briefly,  a 
and  y  may  be  expressed  in  terms  of  the  second  moments 
of  the  input  shape.  The  resulting  equations  constraint  a 
and  y  to  a  closed  curve  on  the  toroidal  (a.y)  space  which 
is  guaranteed  to  contain  the  correct  a  and  y  values,  of 
which  there  may  be  several.  This  locus  must  then  be 
searched  for  (a.y)  values  that  are  "correct",  i.e.  that 
explain  the  input  shape  as  a  symmetric  original  shape 
viewed  under  orthography.  For  example,  any  image 
triangle  has  three  such  (a.y)  pairs. 


y  =  tan"l((nv02cot  a-m’]j)/(nr 1(cot  a-nr20)) 

where  the  tire's  are  measured  in  the  image.  Constraint 
C  implies  a  and  y  lie  on  a  1-D  locus  in  (a.y)  space. 

2.  Practicalities 

The  evaluator  research  is  concerned  with  noisy  digital 
images,  not  continuous  mathematics.  Quantization  and 
noise  effect  have  considerable  effect  on  mathematically 
perfect  evaluators.  Further,  it  has  so  far  been  concerned 
with  reliable  symmetry  detectors,  not  those  that  mirror 
human  performance.  We  find  that  evaluators  based  on 
boundary  features,  while  possibly  relevant  for  humans, 
do  not,  in  the  simple  form  we  implemented,  perform 
reliably  or  predictably. 


3.  Lateral  Symmetry  Evaluators 

All  the  measures  discussed  in  this  section  require  a 
significant  amount  of  computation  to  evaluate.  This  is 
due  to  the  need  to  invert  the  candidate  («,y) 
transformation  on  a  point  by  point  basis.  Estnci  (Section 
3.1)  is  worst  in  this  regard  since  is  requires  transforming 
two  points  completely  for  each  point  in  the  figure  EJm 
(Section  3.3)  is  best  since  it  requires  determining  only 
the  ordinate  of  each  point  in  the  figure.  All  the  measures 
in  this  group  require  processing  an  entire  figure  to  judge 
its  degree  of  symmetry.  This  precludes  time-saving 
techniques  like  abandoning  a  candidate  axis  pair  if  the 
evaluation  passes  some  heuristic  threshold.  Experimental 
results  with  these  evaluators  are  given  in  Section  7. 

3.1  Strict  Symmetry 

An  obvious  evaluator  is  based  on  the  definition  of 
symmetry.  Strict  symmetry  requires  that  D(x0,y0)  = 
D(x0.-y0).  where  D  is  density,  intensity,  texture  or  some 
other  feature  of  interest.  Let  our  measure  of  strict 
symmetry  be  defined  as: 

^sinct  =  2  i  txw  —  Dovy0)  | 

Deprojecting  through  the  best  (ay)  minimizes  E^,. 
Our  definition  of  skewed  symmetry  for  a  figure  is 
equivalent  to  finding  E,trict  =  0  when  measuring 
original  point  locations  in  the  coordinates  implied  by 
some  (a,y)  pair.  Clearly  this  measure  is  perfect  in  the 
continuous  realm.  Unfortunately,  under  the  discrete 
representation  Eslnct  performs  poorly.  The  most 
significant  problem  is  the  dislocation  of  fine  detail.  Esmct 
is  based  on  one-to-one  point  pairings  and  is  very 
sensitive  to  dislocation,  especially  of  high-  frequency 
features  of  interest  such  as  boundaries. 

3.2  Product  of  Inertia 

More  successful  than  the  one-to-one  point  pairing  are 
some  necessary  but  insufficient  conditions  for  symmetry. 
Our  solution  for  the  constraint  C  was  based  on  the 
product  of  inertia,  moment  mu,  being  zero  for 
symmetric  figures.  Let  us  define  the  product  of  inertia 
measure  as: 

Epoi  =  Kll 

where  mn  is  the  moment  measured  in  the  original 
figure,  not  the  image.  In  practical  terms,  we  first 
deproject  by  applying  the  inverse  of  the  candidate  (o.y) 
skewing  and  then  measure  mu.  Probably  because  of  its 
global  nature,  this  measure  is  less  sensitive  to  dislocation 
than  E^,  (a  property  it  shares  with  other  moment 
measures). 

It  appears  Ep^  lacks  discriminatory  power  due  to  the 
nature  of  constraint  C  itself.  Since  we  exploit  the 
property  E^,  =  0  for  symmetric  figures  in  the  solution 
for  C,  the  (a,y)  candidate  pairs  in  the  locus  for  a  given 


figure  are  just  those  that  transform  some  symmetric 
figure  into  one  with  the  moments  measured  in  the 
image.  The  observed  range  of  values  for  Ep0j  is 

unsatisfactorily  small  and  close  to  the  minimum  value  of 
zero. 

3.3  Third  Moment 

A  measure  of  a  symmetry  (used  in  statistics  to 
measure  skewness)  is  the  third  moment  in  y,  mov 
Define: 

EJm  =  H3I 

Again,  we  measure  this  moment  in  the  original 
figure,  not  the  image.  The  measure  is  minimized  by  a 
symmetric  figure. 

This  measure  exhibits  at  least  one  major  weakness: 
EJm  is  minimized  by  antisymmetries  is  well  as 
symmetries.  This  is  most  noticable  when  evaluating 
parallelograms,  which  are  judged  to  have  an  infinite 
number  of  equally  good  (a.y)  when  there  are  actually 
only  four.  For  many  general  figures,  Ejm  has  proven 

quite  effective  in  discriminating  axes  of  skewed 
symmetry. 

4.  Radial  Section  Symmetry 

This  technique  is  based  on  the  polar  coordinates  of 
points  in  the  figure  rather  than  their  Cartesian 
coordinates.  It  has  been  the  most  effective  evaluator  in 
our  experiments  (Section  7).  Divide  a  figure  into  n 
sectors  by  drawing  n  equally  spaced  rays  with  their 
endpoints  at  the  center  of  the  figure,  aligning  the 
positive  ray  of  the  x  axis  with  one  of  these  sector 
boundary  rays.  Number  the  sectors  1  to  n  clockwise  from 
the  sector  whose  counter-clockwise  border  is  aligned 
with  the  x  axis. 

Let  the  area  of  the  figure  falling  within  sector  /  be  A ,. 
For  a  symmetric  figure  A,  and  A„_/+  j  will  be  equal  for 
I  <  i  <  n.  Define  the  measure  of  sector  symmetry  as: 

Ess  =  2  l\  -  A„-,+ 1  I.  I  <  i  <  r n/T\. 

Ejj  is  minimized  for  a  symmetric  figure. 

A,  may  be  quickly  calculated.  Rather  than  transform 
the  points  in  the  image  figure  and  place  them  in  the 
untransformed  sectors,  we  may  place  the  untransformed 
points  in  transformed  versions  of  the  sectors  L  We 
exploit  the  fact  that  any  angle  0O  (measured  relative  to 
the  x  axis)  in  the  original  figure  is  skewed  to  an  angle  0, 
in  the  image  figure  by  the  relation: 

0j  =  tan'1  (tan0o  +  cot(y-a))  -t-  a 

The  transformed  sector  boundaries  can  thus  be  easily 
calculated.  We  only  need  to  calculate  the  0,  for  each 
point  in  the  image  figure  once,  since  they  are  measurable 
in  the  image  and  do  not  depend  on  the  candidate  skew 
transformation. 


A  suggested  implementation  is  to  calculate  a  sorted 
list  of  the  0,  once.  For  each  candidate  (a. y)  pair  generate 
the  transformed  sector  boundaries.  Find  the  sector  in 
which  the  least  0,  falls.  Test  each  0,  against  the  upper 
boundary  for  the  current  sector.  Increment  the  area  for 
the  current  section  if  0,  is  less  than  the  sector  boundary. 
Advance  to  the  next  sector  if  not. 

Using  32  sectors,  evaluations  of  (a.y)  pairs  as  close  as 
five  degrees  total  difference  in  a  and  y  vary  by  a  factor 
of  eight  or  more  when  one  of  the  pairs  is  a  solution.  The 
result  is  a  high  degree  of  discrimination  among  (a.y) 
transformations.  Adaptations  of  that  substitute  polar 
moments  over  sectors  rather  than  areas  of  sectors  could 
be  developed.  Since  such  moments  would  essentially 
measure  powers  of  the  points'  distance  from  the  origin, 
which  is  invariant  under  skew,  they  can  be  readily 
calculated  from  measurements  in  the  image.  We  have 
not  found  the  need  for  such  adaptations,  even  for  highly- 
distorted  image  figures  or  image  figures  derived  from 
unconnected  originals. 

5.  Compactness 

A  widely-used  shape  number  is  compactness,  defined 
as  area/perimeter2.  Define: 


which  takes  on  the  maximum  value  of  1/4*  for  a  circle. 
Brady  and  Yuille  use  compactness  to  determine  three- 
dimensional  orientation  from  two-dimensional  contour. 
Ihey  show  that  compactness  is  maximized  if  a  skew 
symmetric  figure  is  "unskewed"  to  one  of  its  original 
symmetric  shapes  (Brady  and  Yuille.  1983].  For  a 
somewhat  related  measure,  ellipticity,  see  [Proffitt.  1982]. 

We  have  implemented  a  version  of  the  measure  for 
our  discrete  representation.  We  found  it  lacking  in 
discriminatory  power  (Section  7).  Although  area  and 
perimeter  are  elegantly  calculated  in  the  world  of 
continuous  figures,  they  are  not  well  behaved  in  the 
discrete  realm.  Perimeter  may  be  especially  troublesome 
to  compute  accurately  in  comparison  with  other 
measures.  Rosenfeld  reports  experiments  where 
compactness  of  large  digitized  circles  was  computed 
using  two  different  measures  of  perimeter.  For  one 
measure,  the  discrete  compactness  was  maximized  at 
approximately  .98;  for  the  other  measure,  the  maximum 
was  approximately  .73  [Rosenfeld.  1974],  Compare  these 
figures  with  the  expected  figure  of  about  .79.  For 
additional  discussion  of  computing  compactness  over 
discrete  figures  see  [Sankar  and  Knshnamurthy,  1978] 
and  [Kulpa.  1979|. 

Our  version  of  Ecom  is  based  on  the  A,  defined  for 
use  with  Ess.  Approximate  the  radius  for  sector  /  by  z-  - 
y  ( A,  /*)  and  the  perimeter  for  sector  i  by  Pr  =  2»r,  In. 
From  this,  approximate  the  total  perimeter  by  P  =  l  P, 

+  1  |r,  -  r/+t  |,  and  calculate  the  ratio  A/P2. 


The  standard  technique  of  calculating  area  and 
perimeter  length  concurrently  while  following  a  chain 
code  representation  of  the  figure's  perimeter  may  be 
adapted  to  the  purpose  [Freeman,  1974).  This  technique 
is  a  discrete  analogue  of  the  continuous  integration  for 
area  by  Stokes  theorem.  There  are  two  ways  to  approach 
an  adaptation  for  figures  under  various  (a.y) 
transformations.  The  first  is  to  take  the  chain  code 
derived  from  the  image  figure,  make  a  table  of 
transformations  from  angles  and  distances  in  the  image 
figure  to  angles  and  distances  in  the  candidate 
transformed  figure  and  use  these  transformed  measures 
in  the  usual  calculation.  The  second  is  to  transform  the 
figure,  define  the  new  perimeter  and  the  corresponding 
chain  code  and  proceed  as  usual.  The  first  approach 
introduces  less  quantization  noise  and  has  the  advantage 
of  being  much  cheaper  computationally. 

Ecom  works  best  with  connected  figures  but  can  be 
used  with  unconnected  figures  once  the  separate 
components  have  been  identified.  It  has  the 
computational  advantage  that  the  amount  of  perimeter 
information  required  can  be  substantially  less  than  the 
information  required  by  other  measures  to  evaluate 
symmetry.  This  essential  dependence  on  the  perimeter 
means  initial  determination  of  the  perimeter  is  critical 
and  subject  to  some  of  the  same  resolution  issues  that 
affect  tangent  and  curvature  compulation. 

Eeom  inherently  resolves  shear  ambiguity  in  favor  of 
the  most  compact  shape.  Other  measures  of  symmetry 
respond  equally  favorably  to  all  shear-ambiguous 
solutions.  Ecom  produces  local  maxima  for  the  (multiple! 
solutions,  but  it  may  not  be  possible  to  determine  the 
number  of  actual  solutions,  especially  since  some 
solutions  may  be  much  less  compact  than  others  of  equal 
symmetry. 

6.  Boundary  Features  and  Hough  Transform  Evaluators 

The  Flough  transform  is  a  technique  with  a  wide 
range  of  applications.  Typically,  a  set  of  operators  is 
applied  uniformly  over  an  input  parameter  space.  The 
results  of  evaluation  at  each  point  in  the  image  space  are 
consistent  with  certain  values  in  the  space  of  parameters 
to  be  determined.  This  has  been  likened  to  "voting"  for 
the  appropriate  parameters.  In  this  way,  local 
information  in  the  image  space  is  transformed  to  a  global 
solution  in  the  parameter  space.  Experimental  results  for 
the  techniques  below  appear  in  Section  7. 

6.1  Boundary  Pair  Matching 

Take  two  points  (xlr  ylj)  and  (x2j,  y2,)  on  the 
perimeter  of  a  figure  defining  a  chord  across  the  figure 
or  a  lobe  of  it.  Each  such  pair  of  boundary  points 
contributes  one  vote  in  the  parameter  space,  (a.y): 

a  =  tan  l((yl,  +  y2,)  /  <x  1,  +  x2,)) 

y  =  tan  ’Uyl,  -  y 2,)  /  (xi,  x2,» 


That  is,  make  the  assumption  that  this  chord  is 
parallel  to  the  axis  of  skew  and  the  midpoint  of  the 
chord  lies  on  the  axis  of  symmetry.  Symmetric  figures 
maximize  the  contributions  for  (a.y)  pairs  corresponding 
to  axes  of  skewed  symmetry, 

Whs  is  this  a  plausible  evaluation?  We  distinguish 
illegal,  legal  and  actual  solutions.  Illegal  solutions  are 
(a.y)  pairs  that  do  not  fall  on  the  locus  of  constraint  C. 
Legal  solutions  are  pairs  that  do  fall  on  the  locus.  Actual 
solutions  must  be  legal  solutions  and  they  correspond  to 
the  actual  skewed  symmetries  in  the  figure.  If  the  two 
points  are.  in  fact,  symmetrically  corresponding  points  in 
some  legal  solution,  the  chord  does  have  the  desired 
properties.  If  the  two  points  are  not  symmetric  points  in 
a  legal  solution,  the  chord  will  contribute  to  an  illegal 
(a.y)  pair  in  the  parameter  space. 

We  now  need  to  determine  that  the  voting  process 
isolates  the  actual  solutions  among  the  legal  solutions. 
Let  the  length  of  the  projection  of  a  figure  onto  the 
symmetry  axis  for  some  (a.y)  pair  be  L.  Assume  the 
figure  is  convex  and  connected.  If  (a.y)  is  an  actual 
solution,  the  evaluation  of  that  pair  will  theoretically 
receive  L  votes.  This  corresponds  to  the  number  of 
chords  centered  on  the  axis  of  symmetry  and  parallel  to 
the  axis  of  skew  that  can  be  drawn  across  the  figure.  A 
non-solution  will  receive  less  than  L  votes,  else  it  would 
be  an  actual  solution. 

Robust  behavior  is  generally  prov  ided  by  the  (sparse) 
distribution  of  votes  over  the  parameter  space,  which 
tends  to  avoid  clustering  around  legal  but  non-actual 
solutions.  If  we  measure  the  "projection"  of  a  non- 
convex  perimeter  onto  the  symmetry  axis  as  L  =  2  (As 
Pos(Ax/As)  where  s  is  arc  length  along  the  perimeter 
and  (x  >  0)  —  Pos (x)  =  x,  otherwise  Pos(x)  =  0,  then 
again  an  actual  solution  theoretically  receives  L  votes. 

So  long  as  we  compensate  for  changing  L  with 
changing  (a,y),  actual  solutions  can  be  discriminated 
from  non-solutions.  In  the  absence  of  such 
compensation,  this  Hough  transform  generally 
demonstrates  an  acceptable  degree  of  discrimination 
between  solutions  and  "close"  non-solutions  and  a 
moderately  better  degree  of  discrimination  between 
solutions  and  "background."  The  need  to  compensate 
evaluations  differently  for  each  candidate  transformation 
poses  a  practical  problem  for  Hough  transform  methods. 

6.2  Tangent  Pair  Matching 

A  variant  of  the  boundary  pair  matching  idea  uses 
the  following  observation.  Let  t10  and  r20  be  tangents  at 
points  [xl0,yl0]  and  [xl0,  -yl0]  in  a  symmetric  figure. 
Then  the  intersection  of  rl0  and  t20: 

1)  lies  on  the  axis  of  symmetry,  and 

2)  is  invariant  under  skew  (but  not  rotation). 

Let  y  be  calculated  as  before  and  let 

Ax  =  x2,  -  x  1, 


Ay  =  y2,  -  ylj 

a  =  tan'1  ((tan0l,  y2,-tan 02,  y  1,  —  tan  02,  A<> 

/  (Ay  -tan02|X2j  +  tan0l,xl,)) 

where  0lj  and  02,  are  the  angles  of  the  tangents  at  (xl,. 
ylj)  and  [x2j,  y2,J  respectively.  The  value  for  a  is 
determined  by  the  angle  between  the  intersection  of  the 
tangents  and  the  centroid. 

6.3  Strict  Lateral  Hough 

This  is  an  adaptation  of  strict  lateral  sy  mmetry  .  Each 
pair  of  points  in  the  figure,  not  just  the  points  on  the 
perimeter,  contribute  to  the  overall  result.  Otherwise,  a 
and  y  are  calculated  in  the  same  fashion  as  for  boundary 
pair  matching. 

6.4  Discussion 

These  techniques  are  significantly  more  expensive 
than  the  others  discussed.  The  two  perimeter-based 
versions  run  in  time  proportional  to  the  square  of  the 
length  of  the  perimeter,  rather  worse  than  the  other 
perimeter-based  techniques.  The  strict  lateral  technique 
runs  in  time  proportional  to  the  square  of  the  area,  again 
worse  than  the  other  area-based  techniaues. 

An  additional  disadvantage  is  tb  .iu« -cumulate 
partial  results  over  the  entire  <  parameter  space. 
Doubling  angular  resolution  .  axis  solutions 
quadruples  the  size  of  the  parameter  'oao  It  is  possible 
in  some  sense  to  invert  the  technique.  .sed  here  by 
evaluating  transformed  figures  for  orthogonal  axes  rather 
than  directly  evaluating  image  figures  for  axes  of  skewed 
symmetry,  but  this  removes  one  of  the  potential  benefits 
of  using  a  Hough  transform.  Fortunately,  the  constraint 
C  allows  us  to  discard  any  results  that  do  not  contribute 
to  legal  solutions.  Doing  so  saves  considerable  space  at 
the  expense  of  minor  additional  complexity. 

Both  perimeter  based  techniques  have  a  strong  bias 
for  axes  of  symmetry  that  align  with  parallel  lines  in  the 
figure  or  with  the  axis  of  elongation.  The  preference  for 
parallel  lines  is  due  to  the  heavy  contributions  to 
precisely  the  same  a.  The  preference  for  the  axis  of 
elongation  is  due  to  the  relationship  between  the 
projection  onto  the  symmetry  axis  and  the  number  of 
votes  that  can  be  accumulated,  as  noted  above.  This  is 
most  noticeable  when  evaluating  irregular 
parallelograms.  One  symmetry  axis  is  clearly  preferred 
over  the  other  three  solutions. 

The  tangent  pair  matching  technique  demonstrates 
one  advantage  over  the  boundary  pair  matching 
technique.  The  undesirable  preference  for  parallel  lines 
can  be  adjusted  by  weighting  contributions  by  some 
function  of  the  difference  between  the  two  tangents.  It 
has  the  disadvantages  of  additional  computational  cost 
and  a  requirement  for  higher  resolution  representation 
for  equivalent  accuracy  results. 


Hough  techniques  have  two  major  redeeming  values 
in  this  application.  First,  the  perimeter  techniques  are 
robust  enough  to  deal  with  obscuration  since  no  portion 
of  the  perimeter  is  critical  to  the  overall  result.  Second, 
the  techniques  evaluate  all  candidate  ( a.y )  pairs  in  one 
pass  over  the  image  rather  than  requiring  a  separate 
evaluation  of  each  candidate  pair.  When  the  angular 
resolution  desired  is  high,  this  may  swing  the  overall 
running  time  for  evaluation  of  skew  symmetry  to  favor 
the  Hough  transform.  Clearly  we  lose  most  of  this 
feature  if  we  are  forced  to  process  each  possible 
transformation  separately.  An  additional  potential  of 
these  evaluators  over  others  is  the  inherent  parallelism  of 
Hough  transforms. 

7.  Experiments 

Evaluation  procedures  were  implemented  for  Estncl, 
Epor  E-jm.  Ess,  Ecom.  and  the  three  Hough  transform 
techniques.  The  performance  of  each  implemented 
evaluator  was  tested  on  at  least  10  different  figures  in 
judging  (subjectively)  their  relative  merits. 

Calculation  of  the  locus  of  constraint  C  was 
performed  once,  and  as  many  as  four  evaluators  could 
be  invoked  on  each  candidate  transformation.  The 
Hough  transforms  were  implemented  separately,  and  the 
constraint  C  was  used  to  "filter"  the  results  of  applying 
each  transform.  Graphic  displays  of  the  locus  of 
constraint  C  and  intensity  plots  of  the  evaluators  were 
implemented. 

Figure  2  serves  as  a  guide  to  Figures  3,  4,  and  5, 
which  show  the  eight  evaluators  applied  to  three  input 
shapes. 

8.  Conclusions  and  Future  Work 

We  present  and  test  eight  evaluators  of  symmetry, 
some  from  the  literature  and  some  new.  The  sector 
symmetry  evaluator  provides  the  best  combination  of 
accuracy  and  speed.  This  paper  is  a  condensation  of  a 
technical  report  (Fnedberg.  1984).  in  which  eleven 
symmetry  evaluators  are  presented  and  analysed  at 
greater  length. 

Analysis  of  the  sensitivity  of  various  evaluators  to 
different  kinds  and  degrees  of  asymmetry  can  provide  a 
more  objective  method  of  chatsing  among  the  (many) 
alternatives.  The  evaluators  discussed  here  and  others 
encountered  in  the  literature  should  prove  amenable  to 
such  analysis.  Particularly  interesting  forms  of 
asymmetry  are  those  introduced  when  processing  real 
images,  of  course. 

If  there  are  constraints  on  the  original  shape  of  the 
figure  being  processed  (eg.,  figures  are  known  to  be 
portions  of  automotive  connecting  rods)  or  if  there  are 
constraints  on  the  axes  of  skewed  symmetry  (e.g.,  in 
aerial  photography  with  known  relation  between  camera 
and  ground  coordinates)  additional  analytic  constraints 
may  be  developed. 


Skew  transformations  significantly  alter  some  forms 
of  texture.  Consider  radial  lines  as  a  texture  pattern.  In 
an  unskewed  figure,  these  lines  are  of  uniform  density 
around  the  figure.  In  a  skewed  figure,  the  density  of 
these  lines  varies  with  position  and  the  axes  of  skewed 
symmetry  can,  in  fact,  be  recovered  from  the  variations 
in  texture  density.  A  generalization  of  this  behavior  and 
its  exploitation  would  provide  a  useful  tool  in  processing 
obscured  images  or  in  further  constraining  the  axes  of 
skewed  symmetry  for  visible  ones. 

If  we  adopt  an  imaging  model  with  perspective 
projection,  we  can  no  longer  use  constraint  C  on  shear 
ambiguity  or  Kanade's  constraint  on  gradient  ambiguity. 
Many  tasks  in  robotics  and  "near"  vision  must  process 
images  produced  under  perspective  projection. 
Generalizing  either  constraint  would  be  useful. 

Finally,  we  make  no  claims  that  the  symmetry 
evaluators  presented  here  have  any  relation  to  analogous 
processes  in  the  human  visual  system  [Friedberg  and 
Brown.  1984;  Schaefer,  1984], 
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Figure  I.  An  input  image  (upper  left),  the  locus  of 
(a.y)  weighted  in  intensity  by  the  evaluator  output 
(upper  nght),  the  evaluator  output  displayed  as  ordinate 
with  a  as  abscissa  (lower  nght),  and  the  symmetric  result 
of  deprojecting  the  input  figure  through  the  (a.y)  skew 
corresponding  to  one  of  the  five  maxima  in  evaluator 
output  (lower  left). 
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Figure  2.  Guide  to  Figures  3,  4,  5.  In  each  case  the 
evaluator  outputs  are  highlighted  at  the  correct  (a.y). 
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Figure  3:  ITie  radial  sector  evaluator  performs  best 
on  the  parallelogram,  with  the  strict  lateral  evaluator 
finding  the  rectangular  solutions  and  the  product  of 
inertia  finding  the  rhomboidal  solutions.  Of  special 
interest  is  the  third  moment,  which  is  uniformly  good 
since  it  is  maximized  under  both  symmetry  and 
antisymmetry.  I  he  Hough  Transform  measures  do  not 
perform  well,  with  the  boundary  location  scheme 
working  best. 


Figure  4:  For  a  complex  "thunderbird"  shape,  the 
radial  sector  symmetry  again  performs  best,  with  all 
other  measures  performing  disastrously  except  for 
compactness,  which  finds  two  solutions. 


9ft 


Figure  5:  A  "shape"  made  by  reflecting  a  random 
cloud  of  points  confuses  all  strict  symmetry  and  feature- 
based  schemes.  Again,  the  lateral  sector  symmetry 
performs  best.  Moment  evaluators  respond  strongly  to 
some  solutions. 
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Abstract 

The  first  part  of  this  paper  describes  algorithms  and  com¬ 
plexity  arguments  for  a  near  optimal  hidden  surface  scheme. 
It  starts  with  a  viewpoint  and  a  scene  composed  by  unions, 
intersections,  negatives  of  parametrised  volumes.  Then 
limbs  arc  found  on  the  surfaces  to  given  accuracy,  and  mutu- 
ally  consistent  T-junctions  and  cusps  arc  found  in  their  pro¬ 
jections  to  the  image  plane.  The  surfaces  and  image  planes 
arc  represented  by  quad-trees.  From  these,  the  topology  of 
the  projection  is  extracted,  represented  by  a  “tooth-pick” 
structure  which  orders  the  surface  regions,  that  project  into 
the  same  area  of  the  image.  The  foremost  regions  can  be 
displayed  for  hidilcn  surface  graphics.  The  projection  topol¬ 
ogy  is  invariant  over  a  range  of  viewpoints  and  models.  The 
second  part  of  the  paper  examines  the  projection  topology 
for  changing  viewpoint  and  surface  shape,  and  discusses  how 
it  might  be  predicted  and  represented. 

Introduction. 

There  are  two  parts  to  this  paper.  The  first  describes  an 
optimal  (and  mainly  implemented)  scheme  for  a  variation 
of  the  hidden  surface  problem.  This  is  graphics  whose  goal 
is  not  just  to  display  the  image,  but  to  know  qualitatively 
what  it  is  displaying.  Its  output  represents  the  topology  of 
the  projection  mapping  in  a  “tooth-pick”  graph  structure, 
which  aligns  regions  of  surface  and  image,  and  at  whoso 
nodes  the  quantitative  details  arc  stored. 

Graphics  can  be  thought  of  as  prediction  with  fixed 
viewpoint  and  surface  shape.  The  projection  topology  struc¬ 
ture  is  invariant  to  small  changes  in  viewpoint  and  sur¬ 
face  shape.  The  second  part  of  the  paper  discusses  general 
prediction  by  considering  how  the  projection  topology  can 
change  as  viewpoint  and  surface  shape  vary. 

So  three  different  mappings  are  analysed,  with  decreas¬ 
ing  success, 

/’I  :  surface  •  image 
I '2  :  viewpoint  topology-of-PI, 

P'S  :  surface-shape  - topology- of- P2. 


The  method  of  attack  in  all  cases  is, 

1.  What  arc  the  singularities,  where  a  small  change  in  the 
left  hand  side  produces  a  large  change  in  the  right  hand 
side?  How  can  they  be  found  and  represented? 

2.  How  can  the  singularities,  and  the  continuous  spaces 
between  them  be  organised  and  represented? 

3.  How  can  a  complete  description  of  the  map  be  extracted 
from  1  and  2? 

In  the  case  of  PI,  the  method  is  justified  by  complexity 
arguments  in  section  3.  The  rest  of  the  paper  is  organized 
as  follows: 

(Introduction.) 

Part  One. 

1.  Geometry  of  the  projection  mapping.  Define  some  terms. 

2.  Complexity  arguments  to  derive  the  outline  of  the  hidden 
surface  algorithm. 

3.1  The  three  algorithm  steps. 

3.2  Supporting  algorithms. 

1.1  Modelling  outline. 

4.2  More  Modelling. 

Part  Two. 

5.0  Overview. 

5.1  Singularities  for  changing  view. 

5.2  Singularities  for  changing  surface  shape. 

6.  Prediction  Structures. 

Part  One. 

1.  Ccometry  of  the  Projection  Mapping. 

Before  describing  the  algorithms,  a  few  geometrical  terms 
arc  defined. 

A  limb,  also  called  a  contour-generator  is  a  closed  loop 
of  points  on  the  surface  being  looked  .it,  where  the  line  of 
sight  is  tangent  to  the  surface,  i.e.,  perpendicular  to  the 
surface  normal. 


A  contour  is  the  projection  of  a  limb  to  the  image  plane. 

A  T-junction  is  a  point  in  the  image  where  two  contours 
cross.  There  arc  two  different  limb  points  on  the  same  line 
of  sight,  the  one  closer  to  the  viewpoint  occludes  the  other. 

The  limbs  divide  the  surface  into  forward  and  bar  it  ward 
facing  regions  compared  to  the  eye,  and  are  oriented  so  that 
the  forward  face  lies  on  the  left  (when  looking  down  on  the 
surface  from  outside).  Let  F  be  the  line  of  sight  vector,  and 
L  the  limb  tangent  vector.  Both  F  and  L  lie  in  the  surface 
tangent  plane.  If  FV L  is  parallel  to  N  (the  outward  surface 
normal)  then  the  forward  face,  on  the  left  of  L,  is  closer  to 
the  eye  than  the  right  side,  giving  an  outer  limb  point. 

If  Fv  L  is  anti-parallel  to  N,  then  the  back  face,  on  the 
right,  is  closer,  an  inner  limb  point. 

If  FVL  is  zero  there  is  cusp  point.  F  parallel  to  /,  means 
that  the  tangent  plane  intersects  the  surface  along  F,  which 
is  tile  definition  an  asymptotic  direction  in  the  surface. 

So  L  is  parallel  to  an  asymptotic  direction  in  a  hyper¬ 
bolic  (saddle  shaped)  region  of  surface.  This  is  another  way 
of  defining  a  cusp.  There  are  two  types  of  cusp,  with  L 
pointing  towards  or  away  from  the  eye. 

Fig.  1  shows  an  oblique  view  of  a  torus. 

Whitney  in  1955  proved  that  limbs  and  cusps  are  the 
only  possible  singularities  from  a  generic  viewpoint. 


2.  Hidden  Surface  Complexity  Arguments. 

These  will  show  that  an  optimal  algorithm  must  Gild  the 
singularities;  limbs,  Ts  and  cusps. 

If  e  is  the  maximum  allowed  error  in  the  image,  1  is  the 
total  length  of  the  limbs,  c  the  total  curvature  of  the  limbs, 
then  it  lias  optimal  complexity 

O(nlogn),  where  n  has  OsJTcJe. 

There  is  a  sequence  of  refinements  to  the  following  Qrst  at¬ 
tempt  algorithm. 

1.  Shoot  out  a  conical  bunch  of  rays  from  the  eye. 
Calculate  where  they  intersect  the  surfaces  of  the 
object  being  viewed,  and  order  the  intersections 
along  each  r.ay  by  distance  from  the  eye. 

Algorithm  I  is  inellicicnt  because  neighbouring  rays  (image 
points)  have  the  same  ordering  of  surfaces  along  them,  ex¬ 
cept  when  an  intervening  ray  goes  through  a  limb  i.e . ,  is 
tangent  to  the  surface  there.  So  the  order  of  surface  regions 
from  the  eye  along  an  arbitrary  ray  can  be  recovered  from 
the  ordering  along  the  subset  of  limb  rays. 


2.  Find  the  intersections  of  limb  rays  (tangent  to 
some  surface)  with  other  surfaces  in  their  paths. 

This  is  inefficient  because  between  neighbouring  limb  points 
the  surface  orderings  can  only  change  in  two  ways.  When 
an  intermediate  contour  ray  passes  through  (i)  a  T -junction, 
(ii)  a  cusp. 

3.  First  find  the  limb  rays  then  find  the  subset  of  T 
junctions  and  cusp  rays  and  the  intersect  ions  with 
other  surfaces  there. 

However,  to  find  the  romplcte  surface  orderings  along  rays 
through  cusp  and  T  points,  it  is  not  necessary  to  to  solve 
for  the  intersections  and  then  order  them.  Instead,  the  dif¬ 
ference  in  surface  orderings  arc  already  known: 

(i)  Over  a  limb,  the  forward  and  backward  facing  regions 
arc  inserted  together  at  one  position  in  the  ordering.  For  an 
outer  limb  segment,  the  forward  face  comes  before  the  back 
face,  while  for  an  inner  segment,  the  back  face  comes  first. 

(ii)  Over  a  cusp,  the  forward  and  batkward  facing  regions, 
adjacent  in  the  order,  swap  places. 

(iii)  Over  a  T,  the  two  regions  of  the  other  T  branch  insert 
themselves. 

This  set  of  differences  can  he  solved  to  give  complete  sur¬ 
face  orderings  for  each  image  region,  by  propagating  partial 
orderings  over  the  image. 

■1.  Find  the  singular  rays  at  limbs,  T  junctions  and 
cusps,  then  solve  for  the  unique,  complete  surface 
orderings  in  each  image  region,  that  satisfy  the  or¬ 
dering  differences  at  singular  rays. 

This  is  the  optimal  hidden  surface  algorithm.  Still  unspeci¬ 
fied  arc  the  details  of 

(i)  how  the  limbs,  cusps  and  T  junctions  arc  to  be  found, 

(ii)  exactly  what  surface  regions  are  being  ordered,  (and 
how  to  ensure  that  the  Ts  and  cusps  of  each  region,  solved 
locally  to  within  an  error  bound,  arc  globally  consistent.) 

(iii)  and  how  to  propagate  up  Lhc  orderings. 

These  are  described  in  the  next  section,  but  as  far  as 
complexity  is  concerned,  (ii)  and  (iii)  arc  both 
0(mimber  of  image  areas  between  contours) 
which  is  a  measure  of  the  scene  complexity,  of  smaller  cost 
than  (i). 

Let  e  be  the  maximum  allowed  error  distance  between 
the  approximate  and  exact  contours  in  the  image,  f  be  the 
total  length  of  the  limbs,  <:  be  the  total  curvature  of  the 
limbs,  and  n  be  the  number  of  limb  rays  solved  for. 

The  the  average  step  length  between  adjacent  rays  is 
f/n,  the  average  step  angle  is  r/n,  and  the  interpolation 
error  is  0(st.op-lcnglh  x  step-angle)  so,  0((e/n *)  <  <?(<). 
and  ii  must  be  Q(^/lc/e). 

Meanwhile,  the  cost  of  solving  iteratively  for  one  limb 
point,  is  O ( !og( initial-error / Ii nal  error) ) .  The  final-error  is 
()(<■),  and  tli c  initial-error  is  (^(step-length  x  step-angle), 
which  makes  the  cost  O(log((r/cn1))  -  0(1). 


This  moans  I  hut  tin'  cost  of  solving  for  all  n  contour 
rays  is  O(o).  The  cost  of  finding  each  cusp  is  0(1);  just 
see  if  the  limit  to  ray  angle  has  changed  sign.  To  find  T 
junctions,  the  «  contour  segments  have  to  he  tested  for  in¬ 
tersection  with  each  other,  costing  O(ulogri).  (Assuming 
that  no  topological  information  is  being  used  to  direct  the 
search.) 

Adding  O(n)  and  O(ulogn),  results  in  I  he  complexity 
of  an  optimal  hidden  surface  algorithm,  O(nlogn),  where  n 
is  0{y/lc/e). 

Actually  the  ()(«)  limb  solving  dominates. 

3.1.  Three  Steps  of  Hidden  Surface  Algorithm. 

1.  Split  the  whole  surface  up  into  maximal  regions  that 
project.  (1  l)to  the  image,  i.e..,  which  do  not  twist 
around  to  hide  themselves.  These  arc  sub-regions  of 
the  forward  and  hack  facing  areas  hounded  by  limbs. 

2.  Split  the  image  plane  up  into  areas  that  have  a  constant 
ordering  of  the  surface  regions  that  project  onto  them. 
Each  area  is  bounded  by  a  number  of  contour  segments 
with  Ts  or  cusps  at  the  corners. 

3.  In  each  image  area  work  out  what  the  ordering  actually 
is.  There  sems  to  lie  an  optimal  algorithm  to  do  this 
which  requires  only  the  information  provided  by  Steps 
1  and  2. 

For  hidden  surface  graphics  including  transparency,  the  sur¬ 
face  regions  can  be  displayed  in  order,  up  to  the  first  opaque 
one  for  each  area  of  the  image. 

Step  One. 

1.  Step  one  first  finds  the  limbs,  which  divide  the  surface  up 
into  alternating  forward  and  backward  lacing  regions  called 
fates  (w.r.t.  the  viewpoint).  Foi  planar  surfaces  tangent 
to  the  line  of  sight,  the  limb  is  taken  along  the  front  edge, 
so  that  the  plane  belongs  to  the  back  facing  region.  Each 
face  is  bounded  by  a  set  of  different  limbs  called  limb  sets. 
Successive  segments  of  the  limbs  are  stored  in  quad-trees 
representing  the  coordinate  patches  of  the  surface.  Itcgions 
of  the  quad  trees  between  limbs  are  filled  to  extract  the  limb 
sets. 

2.  The  only  way  that,  a  fare  can  occlude  itself  is  by  having 
T  junctions  or  cusps  in  its  limb  set  .  If  there  are  none,  then 
the  face  projects  (1  l)to  the  image. 

3.  Ts  and  cusps  mean  that  the  face  needs  to  he  split  up 

further  into  (I  I )sub-regions.  There  seems  to  lie  a  decom¬ 
position  algorithm  that  makes  the  splitting  unique,  (it.  pairs 
up  Ts  and  cusps  in  tin:  limb  set),  and  at  the  same  lime  en¬ 
sures  that  the  combination  of  Ts  and  cusps  is  cmisilcut  with 
a  physil  ally  possible  surface,  (approximat  iug  the  exact  one). 
I'aving  the  Ts  and  cusps  consistent  for  each  face  separately, 
<  >  the  whole  image  is  consistent. 

The  result  is  a  graph  structure  where  the  nodes  are  the 
(I  l)si.rfacr  regions  and  the  links  arc  adjacency  along  a 
limb,  or  projected  limb  indicated  by  the  decomposition. 

Step  Two. 

In  order  to  searrli  for  T  junctions  in  contour  sets  (the 
project  ion  of  limb  sets),  a  represent  at  ion  of  the  image  has 


already  been  built  tip  in  Step  1.  It  is  a  quad-tree  of  part 
of  the  image  plane.  However  since  it  is  not  yet  being  used 
for  any  distance  measurements,  it  might  be  better  to  think 
of  it  as  a  small  sphere  surrounding  the  viewpoint,  that  the 
surfaces  are  being  projected  onto. 

In  this  step  T  junctions  between  contours  in  different 
sets  are  found.  There  are  no  global  consistency  conditions 
for  these  Ts.  The  smallest  undivided  areas  of  the  image  are 
extracted  from  t lie  quad-tree,  to  build  a  grapli  of  the  image 
where  the  nodes  are  .areas  and  the  links  arc  the  projected 
boundaries  of  the  (1  l)arcas  of  Step  one.  The  ends  of  the 
links  arc  cusps  or  T  junctions. 

Step  Three. 

The  last  step  quickly  redistributes  information  already 
contained  in  the  two  graphs  produced  by  steps  1  and  2,  to 
get  the  ordering  of  surfaces  in  every  image  area. 

It  is  important  that  the  algorithm  that  does  this  can 
lie  proved  to  work  in  all  cases;  and  that  the  ordering  of 
surrounding  contours  is  always  resolved  without  having  to 
intersect  a  new  ray  with  both  surfaces,  For  two  discon¬ 
nected  objects,  that  share  no  T  junctions,  exactly  one  ray 
intersection  is  needed. 

Tlie  graph  of  Step  2,  says  which  (T  I) regions  of  surface 
project  over  any  image  point.  A  sketch  of  the  proof  that  they 
can  be  ordered  pairwise,  in  given  in  3.2. F.  The  details  of  the 
ordering  algorithm  have  not  yet  been  worked  out. 

These  3  steps  outline  the  hidden  surface  scheme.  The 
next  sections  give  some  more  details.  Some  previous  ver¬ 
sions  of  these  algorithms  that  found  limbs,  Ts  and  cusps 
without  organizing  the  surface  regions,  have  been  imple¬ 
mented  c.g.,  the  two  views  of  a  helix,  figs.  7,8. 


3.2.  Supporting  Algorithms  for  Step9  1  and  2. 

A.  Finding  and  following  the  limbs. 

B.  Searching  for  T  junctions  and  cusps. 

C.  Decomposing  into  physically  possible  (1 — l)subrcgions, 
giving  image  consistency. 

D.  Refining  T  junctions. 

E.  Image  accuracy. 

F.  Pairwise  Ordering. 

A.  Finding  the  Limbs. 

1.  Search  for  a  point  on  a  new  limb. 

This  is  not  a  time  consuming  step;  the  algorithm  steps 
along  a  fixed  parameter  path,  and  works  out  whether  suc¬ 
cessive  surface  normals  point  towards  or  away  from  the  eye. 
When  they  point  oppositely,  a  limb  passes  between  them. 
Has  that  limb  alredy  beeh  solved?  If  so  continue  search¬ 
ing,  otherwise  give  the  two  spanning  points  to  the  following 
algorithm. 

2.  Follow  the  limb  over  the  surface,  point  hy  point,  until  it 
loops  up  with  itself.  This  is  the  most  expensive  part  of  the 
whole  hidden  surface  scheme. 

Each  limb  point  is  actually  a  pair  of  points  which  span 
the  limb  closely.  Having  solved  a  stretch  of  limb,  at  the  end 
point  we  know  the  tangent  direction  and  the  previous  step 
angles,  (see  fig.  2).  So  an  initial  guess  to  start  a  new  itera¬ 
tion  for  the  next  limb  point  can  be  made  by  extrapolating 
the  curve  some  distance.  The  distance  is  chosen  to  mini¬ 
mile  the  number  of  iterations  per  step  length,  (f  the  step 
length  is  too  long,  the  Newlon-Rapbson  may  not  converge, 
if  too  short,  unnecessary  extra  points  are  found.  It  is  chosen 
to  give  the  constant,  optimal  number  of  iterations  (3  or  4) 
that  gets  a  bounding  pair  with  angles  between  their  surface 
normals  small  enough  to  give  an  accurate  estimate  of  limb 
tangent  for  the  next  extrapolation. 


In  other  words,  step  length  d,  and  step  angle  a,  along 
the  extrapolation  curve  arc  chosen'. so  that  dsina  is  con¬ 
stant,  which  makes  the  initial  error  and  number  of  iterations 
roughly  constant. 

If  u  and  v  are  the  two  surface  parameters,  then  the 
tangent  direction  in  parameter  space  is,  du  -  d(n-f)/dv, 
dv  —  —  tl(n-f) / tlu.  This  and  the  previous  limb  point  arc 
used  to  fit  a  curve  to  the  limb  in  parameter  space,  that 
has  linearly  changing  curvature.  The  extrapolation  curve 
assumes  linearly  changing  curvature. 

B.  T- Junctions  and  Cusps. 

To  find  cusps  on  limbs  check  for  change  in  sign  of  the 
triple  scalar  product, 

Surface-normal  ■  (Linc-of-sight  V  Limb-tangent) 
Finding  T  Junctions. 

The  image  is  represented  by  a  quad  tree,  with  clipping 
usually  done  around  the  borders  of  the  unit  square.  When 
crossing  contours  arc  found  in  a  quad  square,  they  are  ini¬ 
tially  refined  until  the  ordering  at  the  T  is  clear.  Both  ends 
of  one  segment  must  be  closer  to  the  eye  than  both  ends  of 
the  other. 

C.  Decomposing  the  Faces. 

The  test  to  find  cusps  is  local  to  a  small  patch  of  sur¬ 
face,  and  the  T  tost  is  local  to  an  area  of  the  image,  it  is 
important  that  these  local,  approximate  calculations  can  be 
combined  into  a  globally  consistent  image.  Fig.  5  shows 
two  spurious  T  junctions.  Fig.  6  shows  thc.T  junction  of 
a  fishtail  detected  without  its  cusps.  As  the  accuracy  con¬ 
straints  are  relaxed,  the  resulting  image  should  be  the  exact 
projection  of  a  3D  volume  that  approximates  the  real  one, 
degenerating  into  a  sphcrc-likc  blob. 
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'J’hc  decomposition  algorithm  has  not  been  fully  worked 
out,  but  here  is  a  sketch  if  how  it  should  go  for  each  face. 

Start  with  the  surface  graph  for  that  face,  as  if  it  pro¬ 
jected  (1 — l)witli  no  Ts  or  cusps.  This  is  a  single  node, 
with  a  link  for  each  limb  in  its  limb  set..  Find  the  obvious 
cusps.  Use  the  image  quad  tree  to  delect  Ts,  and  refine  them 
to  the  desired  image  accuracy,  using  the  (implemented)  al¬ 
gorithm  in  the  next  section,  I).  If  the  desired  accuracy  is 
smaller  than  the  distance  between  the  false  Ts,  this  will 
merge  them  away.  Then  link  up  the  Ts  and  cusps  around 
sub-regions  which  split  the  face  and  its  graph  node.  Consis¬ 
tency  conditions  (paired  Ts,  cusps,  etc.)  are  applied  around 
the  borders  of  the  subregions,  mid  extra  Ts  and  cusps  are 
added  or  removed  to  resolve  inconsistencies.  LC.g.,  the  two 
undetected  cusps  of  Fig.  6  would  be  added. 

D.  Refining  T  junctions. 

1.  Two  contour  segments  intersect  each  other.  The  longer 
contour  is  bisected  (or  a  halfway  interpolation),  as  an  initial 
guess  to  iterate  to  a  new  limb  solution  point.  If  one  of  the 
two  new  halves  crosses  the  shorter  segment,  this  step  can  be 
repeated. 

2.  But  what  if  it  does  not?  Then  one  end  of  the  short 
segment  must  he  enclosed  by  the  triangle  formed  by  the  long 
segment  and  its  two  halves.  The  contour  at  the  enclosed  end 
is  extended  until  it  leaves  the  triangle.  If  it  crosses  one  of 
the  halves,  1.  can  be  resumed.  But  if  it  crosses  the  long 
segment,  then  it  re-enters  the  stack  of  triangles  formed  by 
previous  successful  bisections.  The  contour  continues  being 
extended  (step  by  step)  through  the  stack,  until  it  exits  over 
a  bisected  side,  and  can  return  to  I .  If  this  never  happens,  it 
finally  cuts  out  of  one  of  the  original  segments  that  formed 
a  T  in  the  quad-tree.  In  that  case  two  quad-tree  Ts  can 
be  merged  away;  they  were  just  an  artifact  of  approximate 
solution  methods. 

E.  Guaranteeing  Image  Accuracy. 

If  /•'  1  and  F2  arc  the  two  bounds,  in  the  frame  of  the 
eye.  and  arc  distance  d  apart,  with  0  the  angle  between  their 
surface  normals  N 1  and  N 2,  then,  sign(iV  1  ■  FI)  is  opposite 
to  sign(/V2  •  F2). 

The  maximum  possible  image  error  is  (see  fig.  3) 
rfc(l  -  cos 0/2)  where  k  is  the  ratio: 
image- plane-distance/min(|  FI  |,  |F2|), 
and  r  is  d/2  sin  0/2. 

dfcsin0/2 
2(1  +  cos  0/2)  <  ’ 
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which  reduces  to,  d^& — -—-—Q  <  2«/fc. 

This  is  the  condition  for  slopping  the  iteration  that  refines 
the  bounds  of  a  limb  point. 

For  a  projected  step  of  length  I  in  the  image,  with 
step  angle  or,  the  error  in  any  interpolation  scheme  will 
be  0(1  sin  o).  a  is  a  measure  of  the  inverse  of  radius-of- 
curvalurc,  so  the  error  is  0(/s/radius-of-ciirvaturc).  There 
arc  two  alternatives.  If  the  image  interpolating  error  needs 
to  be  bounded  hy  fixed  c,  then  the  step  length  l  between  knot 
points  should  be  0( v'radius-of-curvatiirc).  However  if  the 
image  interpolating  just  needs  to  look  good,  then  the  error 
should  be  proportional  to  step  length.  So  Isinaoc  l,  giving 
constant  step  angle,  and  step  length  O(radius-of-curvaturc). 

F.  Pairwise  Ordering  for  Step  3. 

This  is  the  step  that  combines  the  information  in  the 
surface  and  image  graphs,  produced  hy  Steps  1  and  2.  Its 
output  arc  the  complete  surface  orderings  in  each  image 
region,  represented  by  lining  up  nodes  of  the  surface  and 
image  graphs. 

For  each  image  region,  we  can  get  the  set  of  surface 
regions  that  project  over  it.  (By  propagating  around  the 
inside  of  each  contour  set.)  They  can  he  ordered  pairwise, 
hence  completely. 

Sketch  of  proof: 

Take  a  pair  of  (1— l)snrface  regions  A  and  B,  that  in¬ 
tersect  in  the  image.  There  arc  two  ways  that  they  can  be 
ordered  immediately, 

(i)  if  they  have  a  T  junction  between  the  image  projections 

of  their  boundaries. 

(ii)  if  they  share  a  boundary,  are  adjacent  on  the  surface. 

If  neither  of  those  works,  find  a  connected  path  of  re¬ 
gions  from  A  to  B  over  the  surface  of  the  object,  i.e.,  through 
the  graph  from  Step  1. 

Step  down  the  path,  starting  at  A,  checking  each  region 
against  B  by  (i)  and  (ii),  until  one  of  them  gives  an  ordering. 
It  is  the  same  ordering  as  between  A  and  B  themselves.  We 
will  eventually  gel  an  ordering  because  the  last  region  in  the 
path  borders  on  B,  and  (ii)  applies. 
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4.1  Modelling  Outline,  3,2, ID. 

3D  Modelling. 

Scene:  Objects  positioned  in  scene  frame. 

View:  Scene  transformed  into  eye  frame. 

Object:  A  connected  object  is  given  by: 

Primitive  volume,  or 
Deformed  sub-object 
(c.g.  reflection,  stretch,  negative),  or 
union  of  sub-objects,  or 
intersection  of  sub-objects. 

Primitive  volume:  Parametrized  volume, 

that  is  piecewise  continuous 
and  differentiable, 
c.g.  Generalized  cylinder, 
half  spare  of  a  plane. 

This  is  theoretically  equivalent  to  having  the  whole 
scene  surface  parametrized  in  a  number  of  coordinate 
patches.  Volume  models  arc  used  because  they  arc  often 
more  compact  mid  easier  to  describe. 

Examples  of  deformations: 

One  wing  of  an  aircraft  can  be  modelled  as  the  reflection 
of  the  other.  A  tube  will  be  the  intersection  of  the  outer 
cylinder  with  the  negative  of  the  inner  one.  The  deformation 
of  objects  can  be  non-linear.  E.g.  looking  at  the  scene 
through  a  distorting  lens. 

E.g.  Definition  of  a  screw,  formed  by  cutting  out  the 
helical  threads. 

(Union  (screw-head  head-frame) 

(Intersection  (screw-body  body-frame) 

(Negative  (screw -threads- helix  body-frame)))) 

2D  Modelling. 

The  image  plane,  and  each  coordinate  patch  of  the 
surfaces  have  lines  and  regions  represented  by  quad-trees. 
The  lines  are  stored  with  step  length  roughly  proportional 
to  v'radius  of-curvature,  to  give  constant  accuracy,  as  ex¬ 
plained  in  section  3.2  E.  The  discrete  version  of  this  is, 

step-length  cx  -  d — ,  with  a  minimum  length 

y  i»i»»  (step*  angle) 

equal  to  the  allowed  image  error.  The  quad  squares  are 
subdivided  until  the  ends  of  a  step  are  in  different  nodes. 
Example  of  use:  Intersections  of  lines  are  found  using  the 
quad-trees. 

ID  Modelling. 

Lines  are  represented  by  2-way  list  of  knot  points,  con¬ 
taining  curvature  information,  and  intersections  with  other 
lines. 

Topological  Modelling. 

Projection  mapping  of  viewed  surfaces  to  image: 

The  topology  is  represented  by  a  "tooth- pick”  struc¬ 
ture.  A  small  number  of  tooth-picks,  emanating  from  the 
eye,  spear  through  surface  regions  of  the  scene  and  into  re¬ 
gions  of  the  image.  They  line  up  occluding  patches  of  sur¬ 
face  bounded  by  limb  segments  and  their  projections.  This 
is  invariant  for  a  range  of  viewpoints  and  models. 


4.2.  More  Modelling. 

A.  Preprocessing. 

B.  Primitive  Volume  Details. 

C.  Quad  Trees. 

Preprocessing  by  the  3D  Modelling  System. 

Given  an  Object,  scene,  or  view  definition,  perhaps  with 
a  repeated  sub-object  deformed  in  different  frames,  the  mod¬ 
elling  system  constructs  an  Objcct-troc,  with  each  node  de¬ 
scribing  a  separate  object  instance,  and  the  leaves  being 
the  primitive  volumes.  The  nodes  provide  a  place  to  store 
information  about  each  object.  For  example,  sequences  of 
linear  transformations  between  coordinate  frames  arc  com¬ 
bined.  Another  example;  at  the  leaves  of  the  Object-tree  for 
a  view,  the  coordinate  patches  and  limbs  of  the  surface  arc 
represented. 

There  are  two  more  things  that  the  modelling  system 
docs  before  it  can  be  used  (c.g.  by  the  graphics  program). 
First  it  finds  the  common  intersection  lines  on  the  sufaces  of 
intersecting  objects.  At  present  this  information  is  given  by 
hand.  Then  it  works  out  which  surface  areas  of  the  primitive 
volumes  of  intersecting  objects  actually  lie  on  the  surface  of 
the  composite  object.  It  implements  the  formulas: 

<3(a  U  b)  =-■  (( Da  f)  (Nog b))  U  (db  D  (Ncga))) 
da  n  6)  —  ((da  Pi  b)  U  (db  PI  a)) 

where  Neg  means  negative  volume,  to  decide  which  side  of 
cadi  intersection  line  is  on  the  outci  surface.  The  intersec¬ 
tion  lines  are  stored  in  the  coordinate  patch  quad-trees  with 
the  outer  surface  on  the  left. 

4.2. B.  Primitive  Volumes. 

The  basic  building  blocks  are  simple  parametrized  vol¬ 
umes.  They  have  three  parameters,  (t,n,r)  with  each  vol¬ 
ume  point  P  —  P(t,  s,r)  having  unique  parameters. 
dP/dt,  dP/ds,  dP/dr  form  a  right  handed  (not  ncccss. 
orthogonal)  set. 

P  is  continuous  in  its  parameters,  and  piecewise  dif¬ 
ferentiable.  The  surfaces  of  discontinuity  (called  jump- 
surfaces)  nmsl  be  planar  in  parameter  space.  The  purpose  of 
this  restriction  is  that  it  makes  it  easy  to  interpolate  where  a 
line  segment  in  parameter  space  has  crossed  a  jump-surface 
and  to  work  out  the  closest  jump-surface.  At  present  those 
surfaces  must  he  parallel  to  a  parameter  plane.  To  allow 
more  general  volume  discontinutics  and  surface  edges,  ar¬ 
bitrary  planar  jump-surfaces  could  be  implemented  using  a 
Kd-tree  structure. 

The  bounds  of  the  volume  arc  defined  by  restricting  the 
parameters  t,s,and  r  to  all  lie  in  some  volume  of  parameter 
space,  typically  the  unit  cube.  So  there  is  a  mapping  (posi¬ 
tion  function  P)  from  the  unit  cube  of  parameter  space  to 
real  coordinate  space. 


Opposite  faces  and  edges  of  the  cube  can  be  identified 
with  each  other,  e  g.  when  one  of  the  parameters  is  an  angle, 
in  cylindrical,  or  spherical  coordinates. 

Thr  remaining  faces  of  the  cube  map  onto  the  surface  of 
the  volume.  They  have  separate  quad  tree  representations 
where  they  store  special  points,  regions,  and  lines,  such  as 
limbs  or  curves  of  parabolic  points.  Rich  face  will  form  a 
coordinate  patch,  with  one  parameter  fixed,  and  the  other 
two  parametrising  the  surface.  This  is  true  for  reasonable 
parametrizations. 

So  there  are  two  kinds  of  edges  on  the  volume  surface, 

1.  where  patches  meet. 

2.  where  jump-surfaces  (derivative  discontinuities)  inter¬ 
sect  the  parameter  cube. 

4.2.C  Quad  'Fl  ees. 

T  he  condition  for  splitting  squares  up  is  that  the  oppo¬ 
site  ends  of  each  line  segment  stored  lie  in  diircrcnt  areas. 
When  testing  for  crossing  or  different  segments,  the  longer 
one  is  split  up  further  to  the  level  of  the  shorter. 

At  the  nodes,  the  segment  end  points  are  ordered 
around  the  borders  of  the  square.  This  means  that  n  seg¬ 
ments  in  a  square  can  be  tested  for  intersection  with  each 
other  in  O(n)  time. 

Also,  regions  of  the  tree  can  be  marked  by  stepping 
around  adjacent  squares,  moving  away  from  tlic  inside  of 
the  directed  border  segments. 

Part  Two. 

When  trying  to  unravel  the  structure  of  the  projection  map¬ 
ping  there  are  two  questions. 

1.  What  arc  the  singularities  and  structure  of  the  map¬ 
ping 

PI  :  surface  >  image  ? 

This  is  the  problem  addressed  hy  the  first  part  of  of  the 
paper.  I  lie  tooth-pick  representation  summarises  the  quali¬ 
tative  structure  of  PI,  while  providing  a  fr  mework  for  the 
quantitative  details.  Another  way  of  thinking  of  PI  is, 

PI  •  ray  — »  image,  where  one  ray  shoots  out  of  the  eye 
down  the  line  of  sight  and  intersects  the  surface  in  a  num¬ 
ber  of  places.  The  ray  lias  two  degrees  of  freedom  (d.o.f.) 
Restricting  it  by  one  dimension,  gels  the  generic  linili  sin¬ 
gularities;  by  another,  gets  the  generic  point  singularities  of 
cusp  (local)  and  T  junction  (not  local). 

PI  :  ray  image,  2  d.o.f. 

Pit  :  limb-ray  -*  image,  1  d.o.f. 

Pill  :  cusp-ray  — *  image,  0  d.o.f. 

PI  12  :  T-ray  -  ►  image,  0  d.o.f. 

2.  How  do  the  singularities  change  for 

(i)  varying  viewpoint,  P2  :  Viewpoint  ->  PI, 

(ii)  varying  surface  shape,  P3  :  Surface-Shape  -*  P2  ? 
The  singularities  of  P2  arc  roughly  derived  in  the  next  sec¬ 
tion;  here  the  three  types  of  2D  sional  singularities  arc  sum¬ 
marized.  Each  has  a  generating  curve  and  directions  which 
sweep  out  the  ruled  surface  of  viewpoint  singularities. 


1.  Line  of  parabolic  points,  with  asymptotic  direction  at 
each  point. 

2.  Line  of  asymptotic  inflexions,  with  (2)  asymptotic  di- 
rcctions  at  each  point. 

3.  I  air  of  I  generating  lines,  where  corresponding  points, 
one  from  each  line,  have  a  common  tangent  plane.  The 
special  direction  is  along  the  line  joining  corresponding 
points. 

The  simplest  singularities  of  |>3  occur  when  the  gener¬ 
ating  lines  of  1*2,  appear,  merge  or  split. 


5.1.  Singularities  for  Changing  Viewpoint. 

As  viewpoint  changes  the  quantitative  details  of  Pi 
change,  but  its  structure  is  invariant  over  a  range  of  view¬ 
points.  Pi  is  stable  for  a  generic  viewpoint  with  3  degrees 
of  freedom.  Restricting  the  viewpoint  by  one  degree  of  free¬ 
dom,  there  are  three  ways  the  structure  of  P!  can  he  altered. 

1,  (a)  Limbs  in  the  same  limb  set  (i.e.  surrounding  a  forward 
or  bi.ckw:  -d  facing  region  of  surface),  touch  and  merge,  join¬ 
ing  up  two  regions  that  face  the  same  way. 

(b)  A  new  limb  loop  is  born  at  a  point. 

Both  of  these  imply  that  the  derivative 
of  (surface-normal  ■  lino-of-sight)  is  0  in  all  directions  at  that 
point.  The  only  way  that  this  can  be  satisfied  is  to  have  one 
principal  curvature  zero,  with  its  asymptotic  direction  coin¬ 
cident  with  the  line  of  sight.  So  the  singular  viewing  surface 
is  ruled  through  the  surrounding  space,  hy  sweeping  a  line 
tangent  to  the  asymptotic  direction  along  the  closed  loop 
parabolic  curves  on  the  surface.  This  gives  the  mapping, 

P2I  :  vicwpoint-on-swept-surfacc- 1  -»  PI. 

The  asymptotic  direction  is  not  necessarily  tangent  to 
the  parabolic  curve.  Gcncrically,  one  end  points  into  the 
hyperbolic  region,  the  other  into  the  elliptic  region.  Looking 
al  >ng  the  asymptotic  ray  from  elliptic  to  hyperbolic  side,  (b) 
is  ••n;  thr  reverse  direction  gets  (a). 

2.  Two  cu  ps  and  a  T  junction  appear  in  a  fishtail  shape, 
at  a  point  on  a  limb  in  a  hyperbolic  region. 

The  condition  for  this  is  that  the  line  of  sight  is  tangent 
to  one  asymptotic  direction  at  a  point  where  the  asymptotic 
line  has  an  inflexion. 
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To  derive  lli is  result,  (see  fig. 4),  imagine  that  the  view¬ 
point  is  moving  (witli  3  d.o.f)  so  that  the  two  cusps  approach 
each  other  ami  merge.  Since  their  limhs  are  tangent  to  the 
asymptotic  direction,  the  cusps  must  approach  hy  sliding 
in  along  that  direction,  which  gives  the  asymptotic  line  an 
inflexion  point  where  they  meet. 

So  the  singular  viewing  surface  is  ruled  hy  sweeping  a 
lira  in  the  asymptotic  direction  along  the  curve  of  .isymp- 
totic  inflexions.  This  curve  may  intersect  itself  or  intersect 
the  curve  of  asymptotic  inflexions  of  the  other  asymptotic 
direction,  or  be  tangent  to  the  curve  of  parabolic  points, 
e.g.  the  curve  around  the  middle  of  the  hole  in  a  torus. 
This  gives, 

/'22  :  viewpoint-on-swnpt-suiface-2  — *  Pi. 

3.  A  pair  of  T  junctions  appear  at  a  point. 

Two  contours  become  tangent  in  the  image,  and  then 
form  a  pair  of  Ts.  In  order  to  see  this  singularity,  the  two 
surface  points,  whose  contours  are  tangent,  must  have  coin¬ 
cident  surface  tangent  planes,  with  the  line  of  sight  aligned 
through  both  points. 

/J23  :  vicw-point-on-swept-surfacc-3  -- »  P\. 

Pairs  of  T  junction  generating  points  form  into  two  par¬ 
allel  closed  loops.  Each  point,  on  the  upper  loop  has  a  corre¬ 
sponding  point  on  the  lower  loop  with  a  coincident  tangent 
plane,  and  direction  pointing  to  its  mate.  The  singular  sur¬ 
face  is  ruled  by  sweeping  the  direction  around  the  loop. 

There  are  several  interesting  1  d.o.f  viewpoint  singularities 
and  some  weird  0  d.o.f.  ones,  where  the  viewpoint  has  to  be 
at  one  special  point  in  space.  See  [1]  (pages  145  -  149)  for  a 
complete  classification. 

e.g.  /’21 1  :  viewpoint-on-straight  ray  »  PI,  where  the  ray 
is  the  asymptotic  direction  at  a  special  parabolic  point  which 
has  parabolic  curve  tangent  to  asymptotic  direction.  This 
is  the  borderline  between  1(a)  and  (b). 

5.2  Singularities  for  Changing  Surface  Shape. 
Leaving  aside  for  now  the  problem  of  how  the  structure  of 
P2  is  to  be  represented,  what  arc  the  singularities  of  the 
mappings 

/' 3  :  Surface-Shape  — >  P2, 

Q'A  :  Surface-Shape  -->  Pi  ? 

Some  of  I, lie  local  singularities  arc: 

1(a)  A  parabolic  curve  apears  at  a  point,  e.g.  stretching 
.and  bending  a  sphere. 

(b)  A  parabolic  line  pinches  oir  to  form  two  parabolic  lines, 
eg.  turning  a  banana  into  a  dimihcll.  This  can  only 
happen  when  the  two  points  that  approach  each  other, 
where  the  split  occurs,  both  have  asymptotic  direction 
tangent  to  the  parabolic  line.  .  The  two  points  anihi- 
late  each  other  al  the  split.  (They  arc  a  parabolic  line 
analog  of  cusps  on  limbs.) 

2.  Lilies  of  asymptotic  inflexion  appear  and  split.  Their 
cusp  analogs  arc,  points  of  self  intersection,  of  inter¬ 
section  with  other  curves  of  asymptotic  inflexion,  and 
where  they  arc  tangent  to  a  parabolic  line. 


3.  Similarly  the  T  generators  appear,  merge  and  split. 


6  Prediction  Structures. 

In  [5l,  Kocndcri'  k  and  van  Doom  describe  a  graph  struc¬ 
ture  called  the  “Visual  Potential”  which  predicts  what  an 
object  looks  ’like  fr  >m  its  different  characteristic  views.  Each 
node  represents  a  volume  in  the  viewing  space,  with  a  link 
bet  ween  adjacent  volumes.  These  volumes  arc  sliced  out  of 
space  by  the  ruled  surfaces  described  above,  and  each  link 
corresponds  to  a  particular  event  on  the  object  surface.  E.g. 
limbs  meet  at  some  point  on  a  segment  of  a  parabolic  curve. 

A  graph  like  this  provides  a  general  framework  for  3D 
predictions. 

To  be  useful  for  recognition  or  animation,  it  is  not  nec¬ 
essary  to  generate  the  whole  graph;  it  is  enough  that  at  any 
viewpoint  the  structure  of  the  graph  locally  can  be  gener¬ 
ated.  This  is  analogous  to  not  generating  the  whole  image, 
but  only  being  able  to  work  out  wliat  the  image  looks  like 
near  a  single  ray.  At  a  node,  something  like  the  “tooth-pick” 
structure  could  be  stored;  along  with  the  likely  singularities 
(3  different  types),  so  that  links  can  be  generated  in  response 
to  a  change  in  viewpoint.  E.g.,  see  fig.  9  for  the  sequence  of 
views  of  a  “dumbcil”,  moving  from  an  oblique  to  ;ui  eud-on 
view. 
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The  singularities  produced  by  changing  the  surface 
.shape  suggest  a  model  l.eirachy.  One  object  might  be  de¬ 
scribed  in  terms  of  another  by  giving  the  sequence  of  sin¬ 
gularities  that  occur  when  its  surface  is  deformed  to  fit  the 
other's.  E.g.,  Forming  a  “dumbcll”  (sec  fig.  10)  by  deform¬ 
ing  a  sphere.  First  the  sphere  is  bent  into  a  banana  shape 
when  a  parabolic  curve,  bisected  by  a  curve  of  asymptotic 
inflexions  appears  at  a  point  on  the  sphere’s  surface.  Then 
the  two  ends  of  the  asymptotic  inflexion  curve,  which  arc 
also  on  the  parabolic  curve,  meet  up  around  the  girth  of 
the  banana.  The  parabolic  curve  splits  into  two,  and  the 
banana  becomes  a  dumbetl. 
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1.  Abstract 

In  this  paper,  we  introduce  the  information-centered 
theory  of  optimal  algorithms  as  an  approach  to  image 
understanding  problems,  and  apply  it  to  obtain  a  dense 
depth  map  from  a  sparse  one  (the  “2-1/2  D  Sketch"). 
There  are  three  major  results  First,  we  give  a  spline 
interpolation  algorithm  that  is  provably  optimal  in  the 
worst  case  for  surface  reconstruction;  since  it  is  linear  in  the 
data,  it  is  simple  to  compute  using  precomputed  coefficients. 
Secondly,  we  show  that  adaptive  information  (that  is,  the 
intelligent  and  selective  determination  of  which  depth  points 
to  sample  based  on  the  values  of  previously  sampled  points) 
does  not  improve  accuracy  performance;  a  simple  regular 
grid  is  provably  optimal  Third,  we  discuss  designs  for 
implementations  exploiting  the  above  results  which  are  very 
amenable  to  parallel  processing,  and  allow  for  local,  point- 
wise  determination  of  surface  character  without  the  necessity 
for  global  optimization  We  conclude  with  some  remarks  on 
our  construction  and  execution  of  a  preliminary  form  of  one 
such  implementation 


2  Introduction 

The  calculation  of  a  full  depth  map  of  a  scene  from 
information  present  in  an  image  is  a  central  problem  in 
image  understanding.  In  general,  what  is  desired  in  the  full 
depth  map  is  some  “best  surface  that  fits  the  sparse  and 
errorful  depth  data  (the  2-1/2  D  sketch)  derived  from 
shading  binocularity,  motion,  texture,  and  other  “shape- 
from-x  surface  cues  Mathematically,  this  can  be  cast  as 
an  interpolation  problem  subject  to  some  error  criterion. 

Much  work  has  already  been  done  [3,  4,  5]  This  paper 

investigates  the  general  problem  from  a  different  and 

relatively  new  viewpoint  We  attempt  to  answer  the 

following  class  of  questions; 

1  What  algorithms  are  provably  optimal  with 
respect  to  the  accuracy  of  the  constructed  full 
depth  map? 

2  What  information  on  which  to  base  the 
construction  is  provably  optimal  with  respect  to 
accuracy?  (Here,  which  depth  samples,  taken  in 
whatever  place,  in  whatever  order,  however 
intelligently?) 

3.  What  properties  does  the  optimal  algorithm  have? 

Do  the  properties  lead  to  feasible  and,  even, 
parallelizable  computation? 

We  address  the  first  question  in  Section  3,  and'show  that 
spline  algorithms  are  optimal  with  respect  to  the  worst  case 
error  criterion  In  Section  4,  we  first  show  that  adaptive 
information,  which  is  seemingly  much  more  powerful  than 
nonadaptive  information  (and  certainly  more  computationally 
complex),  does  not  improve  accuracy  performance.  Thus  one 
only  has  to  seek  for  optimal  information  among  the  classes 
of  nonadaptive  information  In  Section  S,  we  construct  the 
spline  algorithm  Spline  algorithms  are  linear  in  their  data, 
and  hence  favorable  for  parall-l  computations. 


.4  note  to  the  reader  In  brief,  our  iniention  i  to  show 
how  the  problem  of  transiting  from  a  sparse  depth  map  to 
a  full  one  cati  be  cast  in  the  framework  of  the  theory  of 
computational  complexity  and  optimal  algorithms  Once  ihe 
problem  is  posed  in  tne  context  and  terminology  of  that 
field,  the  solution  is  a  straightforward  special  case  of  several 
existing  theorems  However,  since  the  methods  and  terms  of 
that  subject  are  probably  foreign  to  most  vision  researchers, 
we  also  take  care  in  what  follows  to  explicate,  step-for-step, 
the  reasoning  behind  the  procedures,  explaining  the  more 
abstract  constructs  in  terms  of  the  actual  vision  problem  at 
hand  We  therefore  adopt  the  General  Theory  if  Optimal 
Algorithms  [6l  to  binocujarity  in  what  follows  below, 
retaining  much  of  the  specialized  notation,  but  with  running 
glosses  In  part,  our  intention  also  is  to  alert  the  vision 
community  to  the  relevance  of  the  research  in  this  area. 
Additionally,  we  hope  to  exploit  the  theory  further  in  later 
papers  for  other  vision  problems,  such  as  optimal  surface 
recovery  from  a  monocular  intensity  array  (optimal  shape 
from  shading). 


3  What  is  the  Optimal  Algorithm? 

The  analysis  proceeds  in  several  steps  To  begin,  the 
problem  must  be  restated  as  a  problem  of:  classes  of 
functions  (here,  of  three-dimensional  surfaces),  available 
information,  and  classes  of  algorithms.  The  following 
aspects  must  be  quantified,  and  we  discuss  them  in  turn: 

1  The  space  of  surfaces. 

2.  The  information  available  and  the  dependencies 
by  which  it  is  obtained. 

3  The  class  of  algorithms. 

4.  The  measure  of  error  and  the  meaning  of 

“optimal”. 

5  The  specification  of  splines  and  spline  algorithms. 

6.  The  optimality  of  spline  algorithms. 


3.1.  Choosing  the  Space  of  Surfaces  and  Their  Norms 

We  take  as  our  space  of  (possibly  infinite)  real-world 
surfaces  the  following  class: 

Let  F,  be  the  set  of  all  real-valued  functions  f  defined  on 
R2,  such  that  f  and  its  first  and  second  order  partial 
derivatives  all  belong  to  L2(R2).  That  is,  the  class  of  real- 
world  surfaces  is  smooth  at  least  up  to  local  curvature:  their 
curvatures  are  square-integrable.  In  particular,  this  class 
rules  out  any  surfaces  that  are  merely  piece-wise  continuous 
or  differentiable.  These  two  latter  exceptions,  unfortunately, 
rule  out  true  occlusions  (where  depth  is  discontinuous)  and 
true  corners  (where  the  derivative  is).  Thus,  the  world  to 
be  seen  appears  as  if  it  were  shrink-wrapped:  corners  are 
rounded  and  discontinuities  papered  over.  Inasmuch  as  the 
surfaces  of  objects  tend  to  be  locally  smooth,  however,  this 
appears  to  be  a  reasonable  assumption. 
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\Ve  attempt  to  “see”  a  subimage  of  these  real-world 
surfaces.  They  are  supported  in  a  finite  region  D  of  the  xy 
plane;  visually  .speaking,  it  is  the  xy  plane  that  iorins  the 
background,  and  D  that  forms  a  finite  subirrvtge  For 
simplicity  assume  that  D  is  a  square  Then  the  class  of 
surfaces  F2  tiiat  we  want  to  recover  is  the  constriction  of 
the  surfaces  in  F,  to  D: 

F2  -  {  f  I  D  f  6  F,  }  (!) 


3  2.  Quantifying  the  Idea  of  Information 
To  recover  the  surface  (a  member  of  FA  we  start  with 
samples  of  depth  data  (in  region  D)  of  some  type;  thes' 
samples  we  would  normally  call  information  More 
precisely,  information  is  defined  in  the  general  theory  ^  a 
function  of  the  following  form. 

N  :  F,  —  Rk 

That  is.  each  type  of  information  function  is  a  mapping 
from  the  class  of  given  surfaces  to  k- vectors  of  image 
primitive  values.  Each  f  (each  real-world  surface)  in  F.  is  a 
member  of  the  domain  of  N;  N(f)  is  the  vector  of  samples. 
In  terms  of  the  vision  problems,  a  given  N(f)  tells  in  what 
way  a  smooth  surface,  f,  has  been  captured  into  a  k-vector 
of  extracted  image  primitives.  (Thus,  the  theory  uses 
information  to  make  more  precise  the  concept  of  “intrinsic 
image  ;  information  can  be  velocities,  surface  orientations 
etc  )  In  the  simplest  case,  and  ihe  case  in  point  here,  N 
would  be  the  rule  for  extracting  depth  values  themselves: 

N(f)  “  lf(*|.y|).  f(xs.y2) . f(xk,yk)],  (2) 

(Xj.y.)  e  D,  i  =  l . k. 

For  ease  of  analysis  we  require  that  k  >  3,  and  further, 
that  the  k  depth  values  not  be  coplanar.  This  rules  out 
annoying  special  cases  which,  although  they  are  easily 
handled  would  otherwise  have  to  be  explicitly  addressed  in 
much  of  what  follows. 

We  can  generalize  the  sampling  of  these  data  values  a  bit 
by  writing; 

MO  =  ffxpyi),  i«*l,...,k.  (3) 

One  can  easily  check  that  each  U-)  is  a  linear  functional 
on  k,.  Our  information  is  now  in  the  form  about  which 
most  is  known  in  the  general  theory; 

N(0  =  |L,(f),  L2(f) - Lk(f)|,  i— 1 . k.  (4) 

In  requiring  the  components  of  N(f)  to  be  linear 
functionals,  the  vector  N(f)  is  implicitly  restricted  to  be  a 
collection  of  samples  taken  “in  parallel”  from  f  at  points 
that  can  be  predetermined  independently.  That  is  the  i-th 
component  of  N(f)  depends  only  on  f,  rather  than’  on  some 
dynamically  cnangmg  sampling  method  based  on  the 
previous  ( i-l J  components  of  N(f).  In  short,  this  information 
is  nonadaptwe.  It  is  in  contrast  to  the  information  used  in 
many  optimum-seeking  algorithms  (such  as  root-finding), 
which  selectively  sample  promising  areas  increasingly  more 
densely,  based  on  their  nearness  to  an  optimum. 

Superficially,  an  implicit  restriction  of  N(f)  to 
nonadaptivity  seems  to  be  a  restriction  to  a  less  powerful 
set  of  information-gathering  strategies.  It  will  turn  out 
however,  that  it  has  absolutely  no  effect.  Even  if  we  extend 
the  definition  of  information  to  allow  the  use  of  any 
adaptive  sampling  of  linear  functionals,  no  matter  how 
intelligent,  the  intrinsic  error  is  no  less  than  that  obtained 
by  using  nonadaptive  information  only.  (We  will  address 
this  issue  in  Section  4.)  Thus,  what  is  important  in  terms 
or  the  general  theory  is  simply  that  the  information  lx- 
linear,  such  as  depth  values  of  surfaces  are. 


3.3.  Defining  the  Class  of  Algorithms 
From  the  information  N(f),  which  is  a  k-vector  f  samples 
(here,  of  depth  values),  we  now  choose  an  algorithm  0  to 
recover  f  in  F„  (here,  that  part  of  the  real-world  surface  we 
attempt  to  “see  ').  The  algorithm  0  is  defined  in  a  very 
general  way  as  a  member  of  a  class  of  mappings  as  follows; 

0;  Rk  —  Fj,  0  6  0  (5) 

Thus,  in  the  general  theory  an  algorithm  maps  vectors  of 
samples  (of  a  function  f  in  r ,)  into  a  solution  function  (in 
Fj)  Note  that  in  this  general  definition,  F.  and  F„  are 
usually  different  classes.  For  example,  in  classical 
quadrature  algorithms  for  numerical  integration,  F.  is  the 
class  of  functions  to  be  integrated  N(f)  are  sample  points  0 
IS  the  quadrature  formula,  and  F„  is  simply  R".  the  set'  of 
reals.  Here,  F,  are  real-world  surfaces  and  F„  are  their 
(restricted)  reconstructions  over  I) 


3.4.  Defining  Algorithm  Error,  Optimal  Algorithm 
In  order  to  compare  solutions  and  to  measure  the 
approximation  error  of  0,  F„  can  be  equipped  with  a  norm 
11  llo  By  applying  this  norm,  II  f|I>  -  <#N(f))  ||.,  now 
quantities  the  difference  between  f|D,  which  is  that  portion 
°  |  e  real- world  surface  to  be  recovered  over  D.  and 

flNff)),  which  is  the  surface  we  construct  by  using 
information  N  and  algorithm  <t> 


There  are  many  choices  for  ||  ||2;  even  a  weak  one  will  do. 
ror  our  problem  of  depth  interpolation,  we  can  use  the 
L -norm,  defined  as. 


ii  f  ii,  - 


f2  dx  dy}'/2 


(8) 


Which  norm  is  the  most  natural  or  accurate  measure  of 
algorithm  error?  It  turns  out  that  the  determination  of  the 
optimality  of  the  algorithm  is  independent  of  the  cho:-t 

We  are  nearly  ready  to  define  the  error  of  a  given 
a  gonthm  Based  on  this  definition,  we  will  seek  an 
algorithm  that  reconstructs  the  surface  with  minimum  error. 
We  would  like  tr  define  the  error  of  a  particular  algorithm 
in  the  worst  case  to  be  something  like  the  following: 

e  (N,  0,  f)  =  sup  ||  f  |D  -  0<N(f))  ||2  (7) 

where  the  supremum  is  taken  over  all  f  in  F.  that  satisfy 
the  same  information.  That  is,  the  supremum  should  be 
taken  over  the  set  V(N,f),  where 

V(N.f)  =  (f-  6  F,  |  N(f)  =  N(f)}.  (8) 

This  definition  measures  the  distance  between  the  actual 
computed  solution,  0(N(f)),  and  all  functions  f  in  F,  that 
could  possibly  have  been  the  source  of  the  observed 
information  Since  the  f  in  V(N,f)  are  completely 
indistmguishable  (we  know  nothing  besides  the  information 
N(f)J,  we  cannot  tell  which  of  them  could  have  been  the 
original  function.  Thus,  we  take  the  supremum  to  handle 
the  worst  case. 

The  problem  with  such  a  definition  is  that  the  class  of 
functions  in  F,  is  usually  too  large.  Given  N(f),  there  are 
infinitely  many  interpolating  f  in  V(N,f).  Unless  the  2- 
norm  is,  in  a  sense,  very  weak,  for  many  of  these  functions, 
the  supremum  and  the  worst  case  error  will  almost  certainly 
be  very  large.  However,  in  terms  of  the  physics  of  the 
image  understanding  problem,  many  of  these  surfaces  would 
also  be  physically  impossible  as  well:  some  would  have  to 
pass  through  the  camera  itself;  others  would  be  impossible 
to  fabricate  under  any  known  natural  or  artificial 
manufacturing  process. 
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What  we  usually  prefer  instead  is  a  solution  function  that 
comes  as  close  as  possible  to  “reasonable”  members  of 
V( N(f ).  rather  than  to  all  of  them.  Intuitively,  a  function  is 
considered  “reasonable”  if  it  satisfies  some  desirable  side 
conditions.  Mathema  cally  such  a  function  is  often 
characterized  by  expressing  tne  desired  properties  in  terms 
of  another  norm  (or  semi-norm),  and  defining  “reasonable" 
to  mean  that  these  values  are  within  certain  bounds  We 
will  denote  the  reasonableness  norm  by  the  4-norm'  iust  as 
there  are  many  choices  for  the  2-norm,  the  actual  definition 
of  the  4-norm  is  determined  by  the  problem. 

One  example  of  a  desirable  property  for  functions  in 
V(N,f)  is  smoothness;  this  can  be  quantified  by  applying  to 
elements  of  F,  the  quadratic  variation  semi-norm  This 
semi-norm  is  defined  to  be: 

r  (9) 

Kile  =  {  I  KU*  +  2(fxy)2  +  dx  dy}'/= 

Jr* 

Given  that  second  derivatives  are  closely  related  to  surface 
curvature,  this  semi-norm  has  an  appealing  intuitive  physical 
analogy:  it  measures  the  bending  energy  in  an  ideally  thin 
and  clastic  plate  which  has  been  forced  into  the  shape  of 
f.  We  will  return  to  this  example  when  we  di  ■  uss  our 
efforts  on  implementation. 

We  can  use  this  semi-norm  (or  anv  other  “reasonable" 
one)  to  better  define  algorithm  error.  There  are  many  ways 
to  do  so;  one  way  would  be,  for  each  interpolating  f  ,  to 
divide  the  approximation  error  given  by  the  2-norm  in 
Equation  (7)  by  some  function  of  the  reasonableness  of  f  . 
This  gives  a  type  of  relative  error  for  each  f;  the  more 
badly  behaved  a  surface,  the  higher  its  curvature  and  the 
more  vigorously  its  actual  algorithmic  error  would  be 
restrained.  Note  that  even  particularly  wild  functions  are 
still  required  to  interpolate  tne  information,  so  that  it  is 
likelv  that  their  relative  error  is  rather  small;  the  calculation 
would  be  dominated  by  the  extreme  normalizing  value. 

We  redefine  algorithm  error,  then,  in  the  general  case  as 
follows: 

e  (N,  0,  f)  =  sup  ||  f  |D  -  <V(N(f))  ||2  p  (||f  ||«)  (10) 

where  the  supremum  is  again  taken  oyer  all  f'  satisfying 
the  information  N  (that  is,  f'  is  in  V(N,f)).  The 

reasonableness  function  p  simply  maps  the  positive  value  of 
the  4-norrn  into  a  new  positive  value:  if  p{\)  =  1/x,  this  is 
simple  relative  error;  if  p(\)  —  1,  we  have  recovered 

absolute  error  again;  many  others  are  possible.  Importantly, 
one  of  the  results  of  the  general  theory  is  that  optimality  is 
independent  of  thy  exact  nature  of  p. 

Now  we  can  define  what  an  optijnal  (in  the  worst  case) 
algorithm  is.  The  algorithm  0  in  <Pt  the  class  of 
algorithms,  is  strongly  optimal  if  and  only  if  for  each  f  in 

e  (N,  0‘,  f)  =  inf  e  (N,  0,  f)  (11) 

where  the  infimum  is  tjken  over  all  algorithms  0  in  <P 
That  is,  an  algorithm  0  iq  optimal  if  and  only  if  the 

algorithm  error  from  using  0  (for  any  f  in  F.)  is  no  more 
than  the  algorithm  error  from  using  any  other  0  in  <P. 

Note  that  this  does  not  imply  that  the  optimal  algorithm 
must  necessarily  have  finite  algorithm  error.  However,  if 
the  optimal  algorithm  does  have  infinite  algorithm  error 
(that  is,  if  at  least  one  f  has  infinite  approximation  error), 
then  every  other  algorithm  has  infinite  algorithm  error,  too. 


3.5.  Spline  Functions 
We  next  define  a  particular 
based  on  our  reasonableness 
function  is  derived  from  what 


interpolating  function  which  is 
norm,  ||||,.  Although  this 
appears  to  be  only  a  problem- 


dependent  side  condition,  it  will  be  the  primary  function 
that  leads  to  the  optimal  solution  of  the  interpolation  as  a 
whole. 

Recall  that  V(N,f)  is  the  class  of  all  functions  in  F,  that 
share  the  same  information  with  f  under  the  information 
extraction  function  N  In  purely  visual  terms,  it  is  the  set  of 
all  surfaces  defined  on  the  infinite  background  that 
interpolate  the  depth  data.  Let  l*e  ll*al  member  of 

V(N.f)  with  minimum  4-norm  That  is; 

ll*N(f)H«  =  inf  IKIle  (»2) 

where  the  infimum  is  taken  over  all  f  in  V(N,f).  We  call 
<7N(f,  the  spline  function  interpolating  the  data  N(f).  It  is 
nor  hard  to  show  that  such  a  spline  function  exists  and  is 
unique,  if  there  are  at  least  four  non-coplanar  data  points, 
which  is  true  for  almost  all  cases.  For  more  details,  see  [7|, 
page  71. 

Thus,  this  unique  spline  funetjon  <rN,y  interpolates  the 
information,  and.  because  it  minimizes  the  4-norin,  of  all 
such  interpolants  it  is  the  most  “reasonable”.  This  need 
not  imply  that  it  also  minimizes  worst  case  algorithm  error; 
algorithm  error  is  based  on  a  different  norm,  in  a  more 
constricted  space.  However,  it  is  surprising  that  under  very 
general  conditions,  <rN(fj  directly  provides  the  optimal 
algorithm. 


3.6.  Spline  Algorithms  and  their  Optimality 

Our  last  step  is  to  define  a  class  of  algorithms  based  on 
the  spline  functions;  this  class  will  contain  the  optimally 
interpolating  algorithm  we  seek. 

Thf  algorithm  0’  that  takes  information  N(f),  chooses  the 
interpolating  spline  function  and  then  constricts  it  to 

D,  we  call  a  spline  algorithm. 1  '  More  precisely,  the  spline 
algorithm  is  defined  as: 

0*(  N(f))  -  <rN(f)|D,  (13) 

where  is  the  spline  function.  In  our  example  with 

quadratic  variation  as  semi-norm,  the  spline  is  unique,  so  the 
spline  algorithm  is  well  defined. 

The  import  of  0s  is  given  by  the  following  theorem. 

Theorem  The  spline  algorithm  is  strongly  optimal  (in  the 
worst  case).  That  is, 

e  (N,  0*,  f)  =  inf  e  (N,  0,  f),  for  all  f  6  F,.  (14) 


where  the  infimum  is  taken  oyer  all  0  in  <P.  Further,  the 
optimality  of  the  spline  algorithm  is  independent  of  the 
choice  of  norm  ||||2. 

The  proof  is  based  on  the  following  observations.  First,  it 

can  be  shown  that  the  definition  of  the  algorithm  error  as 

defined  in  Equation  (10)  is  equivalent  to  the  following 
definition,  which  assigns  the  4-norm  a  more  tractable  role: 

e  (N,  0.  f)  =  sup  ||  f  |D  -  0(N(f))  ||2,  (IS) 

where  the  supremum  is  now  taken  over  the  smaller  class 
of  information-satisfying  functions  that  have  only  a  strictly 
limited  reasonableness:  r  must  be  in  V4(N,f),  where 

V4(N,f)  =  {  f-  €  V(N,f),  ||f-||4  <  c  }.  (16) 

The  proof  is  given  in  [7],  Appendix  E;  the  finite  value  c  is 

arbitrary.  Essentially  this  observation  allows  one  to  define 
directly  a  class  of  “reasonable”  functions,  and  to  measure 
approximation  error  on  an  absolute  basis  using  the  2-norm 
alone. 
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Secondly,  the  spiwe  function  iT^JD  can  be  shown  to  be 
the  exact  center  of  the  set  y4(fv)|D,  and  thus  it  must 
minimize  the  worst  case  approximation  error  The  centrality 
is  proven  by  showing  that  every  f  ID  in  V4(N,f)|D  can  be 
expressed  as  the  sum  <*Nin|D  +  h.  where  the  properties  of  h 
are  sufficient  to  show  that  the  difference  -  h  is  also 

in  V4(N,f)|D.  N(0 

For  the  detailed  proof  of  the  entire  theorem,  see  [6] 
Theorem  5.1,  page  76. 


4.  Adaptive  Information  Does  Not  Help 

In  Subsection  3.2,  we  defined  information  as  samples  of 
depth  data: 

N(f)  -  [L,(f),  Lj(f) . Lk(f)],  i— 1 . k.  (17) 

where  L;(f)  =  f(Xj,yj),  and  (Xj.y,)  €  D.  We  call  this 
nonadaptive  information,  since  the  i-th  component  of  N(f), 
L;(f)  depends  only  on  f.  Adaptive  information,  on  the  other 
hand,  attempts  to  exploit  whatever  was  learned  while 
obtaining  the  (i-1)  components  of  N(f).  More  precisely, 
adaptive  information  N*  is 

N*(f)  =  z  =  [z, . zk)  (18) 

where  z,  ==  L,(f,  z,,  .  .  .  ,  z.  .),  i=2,...,k.  In  the  case  of 

depth  values,  ' 

z,  =  f  (Xj,  y,)  (19) 

where  Xj  =  x(z,,  .  .  ,  zj  and  y,  =  y,(z, . Zp,), 

with  (X|,  y,)  6  D 

The  structure  of  adaptive  information  is  much  richer  than 
nonadaptive  information  and  one  might  hope  that,  by  virtue 
of  adaption,  some  intelligence  might  determine  the  location 
for  (Xj,  yj  on  the  basis  of  the  results  of  the  (i-1)  prior 
samplings  Nevertheless,  theory  shows  that,  against 
intuition,  adaptive  information  cannot  aid  surface 
approximation.  For  detailed  discussions  and  proof,  see  [7], 
pages  57-62.  The  formal  proof  is  based  on  the  radius  of 
information  Intuitively,  tne  radius  estimates  the  intrinsic 
error  of  the  problem  For  several  classes  of  problems 
(including  interpolation),  the  radius  cannot  be  reduced  by 
adaptive  strategies,  in  large  part  because  there  exist  fixed 
but  universal  strategies.  This  is  perhaps  the  strongest  result 
of  all:  one  cannot  do  better  in  collecting  data  than  a 
snapshot  does.  Not  only  can  the  data  be  collected  in 
parallel,  it  should. 

At  this  point,  we  have  shown  that  the  spline  algorithm  is 
the  optimal  interpolation  algorithm,  and  that  nonadaptive 
information  suffices.  However,  we  have  not  attempted  to 
optimize  whert  to  obtain  the  information  itself  It  is 
apparent  that  all  sampling  strategies  are  not  equal.  If  we 
are  free  to  select  the  location  of  information  points,  what 
points  are  optimal? 

Restating  this  problem,  suppose  we  are  allowed  to  choose  a 
k-vector  of  information  samples:  N(f)  =  ff(X|,y,),  .  , 

f(xk.yk)j.  Suppose,  too,  that  no  matter  what  information  we 
select,  we  always  use  the  optimal  spline  algorithm.  Since 
the  algorithm  is  tailored  to  the  information,  any  error  that 
would  remain  is  intrinsically  irreducible.  >We  then  can 
define  optimal  information,  denoted  by  N  ,  to  be  that 
information  with  the  minimum  intrinsic  error. 

For  our  depth  map  problem,  the  optimal  choices  for  (Xj.Vj) 
can  be  shown  to  lie  on  a  regular  grid.  More  precisely,  (et 
the  subimage  D  be  the  open  rectangle:  (0,  (n  +  l)h)  x  (0, 
(n  +  l)h)  Then  the  following  information  N  is  optimal  (up 
to  a  constant  factor)  for  surface  recovery: 

N*(f)  =  [f(h,h),  f(h,2h),  .  .  .  ,  f(h,nh),  (20) 

....  f(2h,b),  f(2h,2b) . f(2h,nh)) 


.  .  .  ,  f(nh,h),  f(nh,2h),  .  .  ,  f(nh,nh)| 

These  are  simply  interior  mesh  points  Notice  that  the 
optimality  of  this  information  (and,  of  course  the  resulting 
error  of  the  optimal  algorithm  using  this  optimal 
information)  depends  on  the  norm  ||||2;  in  general,  the 
intrinsic  error  is  monotonically  decreasing  #in  h  For  a  full 
proof  of  the  optimality  of  this  particular  N*  mesh,  see  |1). 


5.  Implementation  of  the  Optimal  Algorithm 

In  the  previous  sections  we  have  shown  the  existence  and 
uniqueness  of  the  spline  interpolating  given  depth  data,  and 
we  nave  shown  its  optimality  for  surface  recovery  problems. 
In  this  section,  we  show  how  the  spline  functions  can  be 
constructed,  with  the  side  condition  of  minimizing  quadratic 
variation.  Note  that  in  what  follows,  we  do  not  necessarily 
require  optimal  information;  the  depth  samples  can  appear 
anywhere  within  the  subimage  D:  in  a  mesh,  aligned  on 
contours,  clustered,  or  even  at  random. 

With  some  restrictions  on  F,,  it  can  be  shown  that  the 
appropriate  reproducing  kernel  is  K(x,y;u,v)  *=  ((x  -  u)2  + 
(y  -  v)2}3'2,  the  Euclidean  distance  cubed  (see  (2|). 

Therefore  the  spline  interpolating  depth  data  *  = 

,  zk)  =  [ffxj.y,},  .  .  .  ,  f(xk,yk)J  is  developed  in  the  usual 
way: 

(21) 

k 

<z,  =  r  ai{(x-xi)2+(y-yj)2}s/2+/31x+^2y+^. 


where  {ckj}  and  {/Jj}  can  be  determined  from  the  linear 
system  of  equations: 

(22) 

k 

53  ail(xj-xi)2+(yj-yi)2}*/!!+^ixi+^yj+^  =  iy 

53  aixi  -  °* 

t -  «■ 

i” 


k 


From  equations  (21)  and  (22),  it  should  be  apparent  that 
the  splines  are  linear  in  the  data.  That  is,  if  <y, 
interpolates  information  z*1*,  and  <T„  interpolates  information 
z'2',  then  CjCTj+CjITj  interpolates  information  c,z‘,'+CjZ*2*.  In 
terms  of  image  understanding,  this  means  that  if  two 
surfaces  are  superimposed  so  that  their  depths  samples 
accumulate,  than  the  superimposition  of  the  two  full  depth 
maps  derived  independently  create  a  valid  full  depth  map 
for  the  ensemble. 

Since  the  spline  algorithm  is  linear  in  depth  data  it  can 
be  easily  rewritten  as  the  weighted  sum  of  basis  splines,  as 
follows.  Suppose  that  the  information  is  merely  i-th  unit 
vector  for  R‘,  that  is,  N(f)  =  e;  =*  [0,  .  .  .  ,  0,1,  0,  . 

,0],  where  the  unit  is  in  tne  i-th  coordinate  position.  This 
simpler  information  constraint  is  satisfied  by  a  unique  basis 
spline  function  o^.  with  the  property  that  <r-( e , )  ■*>  (the 
Kronecker  delta)  In  terms  of  the  depth J  interpolation 
problem,  generates  a  surface  that  has  a  value  of  1  at 
sample  point  (Xj.yj),  is  identically  zero  at  all  other  sample 
points,  and  is  smoothly  rippled  in  all  the  space  between,  in 
order  to  minimize  its  bending  energy  The  spline 
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interpolating  depth  data  z  —  Jz  .  .  ,2k]  then  is  simply 
the  weighted  stmt  of  these  individual  basis  splines,  with  z- 
values  as  the  weights. 

(23) 


Therefore  the  problem  of  computing  <Jt  may  be  reduced  to 
that  of  solving  the  k  independent  suDproblems  of  computing 
each  <T|.  This  has  several  important  consequences  for 
implementation. 

1.  Since  the  heart  of  the  method  is  the  solution  of  a 
system  of  linear  equations,  much  is  known  about 
its  convergence,  stability,  and  running  time  (about 
0(k3). 

2  If  the  location  of  the  information  is  known 
beforehand,  the  basis  splines  can  be  precomputed. 

3  With  precomputed  coefficients,  the  desired 
interpolation  points  can  be  computed  in  parallel 
with  a  simple  SIMD  algorithm 

4.  With  precomputed  coefficients,  any  individual 
value  of  the  solution  surface  can  be  determined 
locally,  without  the  need  for  global  optimization 
over  the  full  surface:  if  just  one  is  needed,  just 
one  is  computed 

S  With  precomputed  coefficients,  the  surface  can  be 
incrementally  updated  Any  sample  value  that 
changes  over  time  has  only  a  linear  effect  on 
existing  interpolated  data.  The  increments  at 
each  point  can  be  computed  in  parallel  with  a 
trivial  SIMD  algorithm. 

6.  If  one  uses  optimal  information,  the  system  of 
coefficients  becomes  highly  regular;  the  matrix  has 
only  a  very  small  number  of  distinct  entries  (for 
large  k,  approximately  k/2  for  a  k-by-k  system). 
Symmetries  suggest  that  efficient  solution  is 
possible,  even  without  precomputation. 

7.  If  one  uses  optimal  information,  the  system  shows 
an  eight-fold  symmetry,  which  could  probably  be 
exploited  both  in  precomputation  and  execution. 

6.  Preliminary  Experimentation 

These  results  suggest  that  it  would  not  be  hard  to 
construct  a  special-purpose  machine  for  surface  interpolation 
that  would  be  very  quick  and  accurate  Using  active 
imaging,  it  could  obtain  depth  samples  on  a  square  grid  of 
k  total  points,  by  ranging  or  by  triangulation.  The  position 
of  these  sample  points  would  remain  fixed,  so  all  coefficients 
could  be  precomputed.  Run-time  computation  would  entail 
only  the  distribution  of  input  data  and  the  calculation  and 
collection  of  output  data,  using  weighted  sums  of 
precomputed  coefficients.  Thus,  the  interpolated  values  at 
any  point  (x,y)  are  given  by. 

^  («) 


where  z  =  [z,.  .  .  ,  zk]  =*  [f(x,.y.),  ...  f(xk,yk)  and 

each  tTj  has  been  previously  computed.  Such  a  SIMD 
algorithm  would  require  about  k  multiplications,  plus  k  units 
of  local  storage  per  process  If  special  purpose  hardware 
were  available,  the  data  could  be  circulated  in  a  type  of 
toroidal  systolic  array  All  output  would  be  complete  in 
roughly  k  cycles.  Precomputation  could  be  achieved  in  k 
parallel  streams  as  well 

We  have  simulated  some  of  this  behavior  on  a  standard 
uniprocessor.  We  briefly  list  the  following  experimental 
preliminary  results,  obtained  by  Terry  Boult.  Many  of  them 
suggest  there  may  be  algorithmic  or  computational 
efficiencies  to  be  exploited. 

1.  Depending  on  how  one  enumerates  the  k  points, 
under  optimal  information  the  Gram  matrix  (the 
matrix  of  the  cubic  distances  which  one  solves  to 
obtain  the  coefficient  matrix)  appears  to  have  a 
recursive  block  Toeplitz  structure. 

2.  Experimentally,  this  matrix  appears  to  be  well- 
conditioned  with  respect  to  computing  its  inverse, 
as  fairly  large  systems  ( k  =  lOOx  100)  show  little 
loss  of  precision:  coefficients  known  to  be 
symmetric  or  equal  retain  their  symmetry  and 
equality  to  about  the  limits  of  ordinary  round-off 
error. 

3.  The  inverse  of  the  Gram  matrix  is  highly  sparse; 
when  k=IOOxlOO,  there  are  only  621  distinct 
entries.  Further,  the  ratio  of  distinct  entries  to 
total  entries  appears  to  decrease  as  k  increases. 

4.  Although  the  basis  splines  do  not  have  compact 
support,  they  appear  to  fall  off  rather  rapidly. 

7.  Related  Problems  in  Computer  Vision 

The  theoretic  results  reported  above-the  optimality  and 
linearity  of  spline  algorithms,  the  sufficiency  of  nonaaaptive 
information,  and  otbers-applv  to  other  vision  problems  that 
can  be  cast  in  the  same  fairly  loose  framework.  The  theory 
requires  that  F,  be  an  arbitrary  linear  space  and  that  the 
information  be  a  vector  of  linear  functionals  on  F,.  In 
terms  of  vision,  the  linearity  of  F,  is  rarely  a  problem,  since 
object  surfaces  superimpose  well;  however,  only  some  classes 
of  image  features  can  be  considered  to  be  linear  information. 
The  trouble  is  that  linear  information  must  also 
superimpose:  the  features  derived  from  "sum”  of  two 
surfaces  must  be  the  sum  of  the  features  independently 
derived.  This  is  often  contrary  to  the  laws  of  geometry, 
physics,  and  optics;  shading  clearly  does  not  sum  well. 
Nevertheless,  there  are  several  important  types  of  linear 
vision  information,  among  which  are: 

1.  The  depth  values  themselves,  at  any  place  in  the 
image.  L|  =  f (xj.y.).  This  is  the  problem  just 
analyzed  Note  that  there  need  not  be  any 
restriction  on  the  location  of  (xj.yj);  they  can  even 
be  chosen  randomly. 

2  The  depth  values  derivable  from  a  contour:  Lj 
=  ffxj.yj),  where  the  values  are  restricted  to  lie 
on  a  particular  curve.  This  is  the  usual  result  of 
the  first  stage  of  edge  detection  methods  (zero- 


crossing  contours,  etc  ),  or  work  on  silhouettes 
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3.  A  given  directional  derivative:  I,;  =  the 

directional  derivative  of  f( with  respect  to  ljr 
where  l;  is  a  direction  vector  in  R2.  This  would 
be  a  one-dimensional  variant  of  the  shape-from- 
shading  problem,  where  the  available  information 
samples  are  the  uniquely  determined  surface 
slopes  in  a  particular  direction  (A  richer 
formulation  would  acquire  the  two-dimensional 
surface  orientation  by  means  of  the  gradient 
vector.  A  richer  and  more  practical  formulation 
would  extend  the  definition  of  information  to 
capture  constraints  on  surface  normals,  such  as 
those  generated  by  shading.) 

The  linearity  of  information  is  important:  it  appears  to  be 
a  key  determinate  of  many  of  the  existing  results  of  the 
general  theory  Inasmuch '  as  many  image  observables  are 
non-linear  functionals  of  the  surface,  these  important  cases 
of  non-linear  problems  remain  to  be  thoroughly  treated. 
Work  continues:  recently,  it  has  been  proved  that  non-linear 
continuous  information  is  not  more  powerful  than  linear 
continuous  information  [8|.  Additionally,  although  most 
results  deal  with  worst  case  models,  the  average  and 
asymptotic  cases  are  also  under  investigation:  see  [7],  page 
8a. 


8.  !•  tit  tire  \\  ork 

We  see  several  areas  of  great  interest  We  plan  to 
investigate  the  effects  of  missing  or  errorful  information 
The  general  theory  is  being  pursued  along  those  lines  as 
well,  so  some  results  may  straightforwardly  fall  out. 

More  practically,  perhaps  our  most  pressing  interest  is  in 
finding  an  efficient  algorithm  for  evaluating  the  basis  spline 
coefficients.  As  an  aid,  we  are  pursuing  the  idea  of 
displaying  the  matrix  as  an  image  itself  to  get  a  better 
understanding  of  its  structure.  Again,  since  an  exact 
solution  may  be  difficult,  we  are  also  exploring  various 
approximate  techniques,  particularly  with  regard  to  replacing 
the  basis  splines  with  ones  that  are  finitely  supported,  or 
With  ones  that  are  only  asymptotically  correctly  shaped  for 
their  position. 


!).  Summarv 

We  believe  ‘that  the  information-centered  approach  to 

algorithms  can  be  applied  to  many  vision  problems  with 
powerful  results  In  this  paper,  we  introduced  the  method 
and  have  shown  how  results  pertinent  to  depth  map 

interpolation  are  corollaries  of  the  general  theory.  The 
major  results  are  that  spline  interpolations  are  provably 

optimal  in  the  worst  case,  that  the  resultant  linear 

algorithms  are  exceedingly  simple  and  parallelizable  given 
some  precomputation,  and  that  adaption  does  not  help.  Our 
hope  is  that  the  application  of  this  approach  to  other  vision 
problems  will  provide  similar  insight  and  computational 
power. 
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ABSTRACT 

A  new  approach  for  the  interpretation  of  optical  flow 
fields  is  presented.  The  flow  field,  which  can  be  produced  by 
a  sensor  moving  through  an  environment  with  several,  in¬ 
dependently  moving,  rigid  objects,  is  allowed  to  be  sparse, 
noisy  and  partially  incorrect.  The  approach  is  based  on 
two  main  stages.  In  the  first  stage  the  flow  field  is  seg¬ 
mented  into  connected  sets  of  flow  vectors,  where  each  set 
is  consistent  with  a  rigid  motion  of  a  roughly  planar  sur¬ 
face.  In  the  second  stage  sets  of  segments  are  hypothesized 
to  be  induced  by  the  same  rigidly  moving  object.  Each  of 
these  hypotheses  is  tested  by  searching  for  3-D  motion  pa¬ 
rameters  which  are  compatible  with  all  the  segments  in  the 
corresponding  set.  Once  the  motion  parameters  are  recov¬ 
ered,  the  relative  environmental  depth  can  be  estimated  as 
well.  Experiments  based  on  real  and  simulated  data  are 
presented. 


1.  INTRODUCTION 

Dynamic  visual  information  can  be  produced  by  a  sen¬ 
sor  moving  through  the  environment  and/or  by  indepen¬ 
dently  moving  objects  in  the  visual  field.  The  interpreta¬ 
tion  of  such  information  consists  of  dynamic  segmentation, 
recovering  the  motion  parameters  of  the  sensor  and  each 
moving  object,  and  structure  determination.  The  results  of 
this  interpretation  can  be  used  to  control  behaviour,  as  in 
robotics  or  navigation.  They  can  also  be  integrated,  as  an 
additional  knowledge  source,  into  an  image  understanding 
system,  such  as  the  VISIONS  system  [HAN78]. 

The  most  common  approach  for  the  analysis  of  visual 
motion  is  baaed  on  two  phases:  computation  of  an  optical 
flow  field  and  interpretation  of  this  field.  In  the  present  dis¬ 
cussion,  the  term  ‘optical  flow  field’  refers  to  both  a  ‘veloc¬ 
ity  field’,  composed  of  vectors  describing  the  instantaneous 
velocity  of  image  elements,  and  a  'displacement  field’,  com¬ 
posed  of  vectors  representing  the  displacement  of  image  el¬ 
ements  from  one  frame  to  the  next.  In  the  latter  case  we 
will  assume  small  values  of  motion  parameters. 

The  second  phase,  i.e.,  the  interpretation  of  the  optical 
flow  field,  is  the  main  concern  of  this  paper.  A  new  scheme 
is  proposed,  which  allows  motion  of  the  camera  as  well  as 


rigid  objects  in  the  scene.  Furthermore,  the  flow  field  i- 
allowed  to  be  sparse,  noisy  and  partially  incorrect.  The 
information  in  only  one  flow  field,  as  opposed  to  a  time 
sequence  of  such  fieldB,  is  utilized. 

Our  approach  is  based  on  two  main  stages.  In  the  first 
stage  the  flow  field  is  segmented  into  connected  sets  of  flow 
vectors,  where  each  set  is  consistent  with  a  rigid  motion  of  a 
roughly  planar  surface.  In  the  second  stage  sets  of  segments 
are  hypothesized  to  be  induced  by  the  same  rigidly  moving 
object.  Each  of  these  hypotheses  is  tested  by  searching 
for  3-D  motion  parameters  which  are  compatible  with  all 
the  segments  in  the  corresponding  set.  Once  the  motion 
parameters  are  recovered,  the  relative  environmental  depth 
can  be  estimated  as  well. 

In  the  next  section,  techniques  existing  in  the  literature 
for  visual  motion  interpretation  are  examined.  The  mathe¬ 
matical  formulation  of  the  model  and  the  task  is  presented 
in  section  3.  In  subsequent  sections,  algorithms  for  flow 
field  segmentation,  estimation  of  motion  parameters,  and 
structure  determination  are  developed.  Preliminary  exper¬ 
iments  based  on  real  and  simulated  data  are  described  in 
section  6. 

3.  LITERATURE  REVIEW 

In  this  section  we  review  methods  existing  in  the  lit¬ 
erature  for  interpreting  optical  flow  fields.  We  concentrate 
on  techniques  which  assume  rigid  motion  and  basically  rely 
on  the  information  contained  in  one  flow  field.  Two  main 
issues  are  emphasized: 

a)  Scene  Complexity.  Some  researchers  assume  that  the 
scene  contains  only  one  object,  or,  equivalently,  that  the 
sensor  is  moving  but  the  environment  is  stationary  (e.g., 
(BRU81J,  [LAW82J,  jTSA84j).  Others  allow  the  scene  to 
contain  several  independently  moving  objects  (e.g.,  [ULL79], 
JNEU80J). 

b)  Robustness.  Optical  Sow  fields  produced  from  real  im¬ 
ages  by  existing  techniques  are  noisy  and  partially  incorrect 
(see  the  discussion  in  |ULL81j).  Many  of  the  algorithms  de¬ 
scribed  in  the  literature  for  interpretation  of  flow  fields  fail 
under  such  conditions.  Other  algorithms  are  less  sensitive 
and  work  reasonably  well  on  real  world  images. 

In  the  first  class  of  techniques,  discussed  in  this  re- 
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view,  only  one  rigid  object  (or  camera  motion)  is  assumed. 
A  few  researchers  [ROA80,  PRA80,  NAG81a,b,  FAN83a,bj 
present  sets  of  nonlinear  equations  with  motion  parameters 
as  unknowns.  Methods  for  solving  such  equations  are  usu¬ 
ally  iterative  and  require  initial  guesses  of  the  unknowns. 
Sensitivity  to  noise  is  indicated  by  experiments  reported  in 
[ROA80,  PRA80,  FAN83a,b). 

Longuet-Higgins  [LON81]  and  Tsai  and  Huang  [TSA84] 
develop  techniques  based  on  solving  a  set  of  linear  equa¬ 
tions.  Furthermore,  conditions  for  the  uniqueness  of  the 
solutions  are  formulated.  However,  difficulties  in  the  pres¬ 
ence  of  noise  are  still  reported  [TSA84]. 

Brass  and  Horn  [BRU81]  employ  a  least  squares  ap¬ 
proach  which  minimises  some  measure  of  the  discrepancy 
between  the  measured  flow  and  that  predicted  from  the 
computed  motion  parameters.  In  the  case  of  general  rigid 
motion  this  approach  leads  to  a  system  of  nonlinear  equa¬ 
tions  from  which  the  motion  parameters  can  be  computed 
numerically.  This  method  is  computationally  more  compli¬ 
cated  than  the  methods  offered  in  [LON81]  and  [TSA84|, 
but  seems  to  be  more  robust  in  the  presence  of  noise. 

Assuming  a  purely  translational  motion,  all  the  flow 
vectors  are  oriented  towards  or  from  a  single  point  in  the 
image  plane.  Determining  this  point,  called  the  focus  of 
expansion  (FOE),  yields  the  direction  of  the  translation.  A 
few  techniques,  reviewed  below,  are  based  on  this  observa¬ 
tion. 

Early  results  based  on  real  images  are  reported  in 
(WIL81).  However,  only  sensor  motion  restricted  to  trans¬ 
lation  is  allowed  and  the  environment  is  assumed  to  contain 
only  planar  surfaces  at  one  of  two  given  orientations.  Thus, 
the  algorithm  can  be  baaed  on  a  search  for  the  FOE  and 
the  distances  to  the  surfaces  in  the  scene.  Lawton  [LAW82] 
describes  a  robust  algorithm  which  has  been  applied  to  real 
world  images  from  several  different  task  domains.  This  al¬ 
gorithm  requires  no  restrictions  on  the  shape  of  the  envi¬ 
ronment,  but  is  still  restricted  to  translation.  It  is  based  on 
a  global  sampling  of  an  error  measure  corresponding  to  the 
potential  positions  of  the  FOE,  followed  by  a  local  search 
to  determine  the  exact  location  of  the  minimum  value.  Re¬ 
sults  for  other  restricted  cases  of  motion  are  presented  in 
[LAW84|. 

Prasdny  [PRA81]  describes  a  method  which  relies  on 
decomposition  of  the  velocity  field  into  rotational  and  trans¬ 
lational  components.  For  a  hypothesised  rotational  com¬ 
ponent,  the  FOE  of  the  corresponding  translational  field 
and  a  related  error  measure  are  computed.  Thus,  an  enor 
function  of  the  3  rotation  parameters  is  obtained  and  the 
solution  can  be  determined  by  minimising  this  function.  Je- 
rian  and  Jain  (JER83)  report  on  difficulties  with  applying 
a  similar  approach  to  noisy  data. 

Rieger  and  Lawton  (RIE83]  develop  a  relatively  robust 
and  simple  procedure  for  computing  the  motion  parameters, 
based  on  the  fact  that  the  differences  between  optic  flow 
vectors  near  occlusion  boundaries  are  oriented  towards  the 


FOE  of  the  translational  field.  However,  the  environment  is 
assumed  to  contain  occlusion  boundaries  which  endow  the 
flow  field  with  strong  discontinuities. 

A  number  of  methods,  presented  in  the  literature,  al¬ 
low  (at  least  in  principle)  unconstrained  sensor  motion  and 
independently  moving  objects  in  the  environment.  Ullman, 
in  his  somewhat  pioneer  work  [ULL78],  examines  small  sets 
of  adjacent  vectors.  If  there  exists  a  unique  rigid  interpre¬ 
tation  consistent  with  all  the  vectors  in  a  given  set,  then 
this  interpretation  is  assumed  to  be  correct  and  the  vectors 
in  the  set  are  grouped  together.  This  approach  seems  to  be 
very  sensitive  to  noise  because  of  its  local  nature. 

Longuet-Higgins  and  Prasdny  |LON80]  and  Waxman 
and  Ullman  |WAX83]  introduce  equations  for  computing 
the  motion  parameters  and  the  local  structure  at  a  given 
point  in  the  environment  from  the  flow  field  and  its  first 
and  second  spatial  derivatives  at  the  corresponding  point 
in  the  image.  If  the  scene  consists  of  several  objects  in 
relative  motion,  then  a  separate  computation  can  be  carried 
out  on  each  one.  However,  local  estimates  of  the  second 
derivatives  of  the  optic  flow  seem  to  be  inaccurate  in  the 
presence  of  noise,  and  no  algorithm  has  been  presented  for 
reliably  computing  such  derivatives. 

More  global  approaches  are  proposed  in  [NEU80] 
and  [BALSlbj.  Neumann  [NEU80J  proposes  an  elegant 
hypothesise- and-test  scheme:  for  any  rotation  hypothesis, 
the  translation  component  may  be  decomposed  such  that 
motion  compatibility  of  many  flow  vectors  can  be  easily 
tested.  This  technique  heavily  relies  on  the  assumption  of 
orthographic  projection. 

Ballard  and  Kimball  |BAL81b]  apply  the  generalised 
Hough  technique  to  the  optical  flow  field  and  thus  extract 
the  motion  parameters.  This  is  a  global  approach  which 
is  relatively  insensitive  to  noise.  In  principle,  it  can  also 
be  used  in  scenes  containing  independently  moving  objects. 
However,  the  depth  information  is  assumed  to  be  known, 
thus  making  the  task  much  easier. 

This  review  demonstrates  typical  constraints  and  weak¬ 
nesses  of  algorithms  reported  in  the  literature.  No  algo¬ 
rithm  for  interpretation  of  optical  flow  fields  in  scenes  con¬ 
taining  several,  independently  moving,  rigid  objects,  has 
been  shown  to  work  with  noisy,  real  world  data,  unless  se¬ 
vere  constraints  are  assumed  or  additional  information  is 
utilised. 

3.  THE  MODEL  AND  THE  TASK  — 

A  MATHEMATICAL  FORMULATION 

3.1  Basic  Model  and  Equations 

In  this  section  we  present  a  notation  for  describing  the 
motion  of  a  t  imers  through  an  environment  containing  in¬ 
dependently  moving  objects.  We  also  review  the  equations 
describing  the  relation  between  the  3-D  motion  model  and 
the  corresponding  optical  flow,  assuming  a  perspective  pro¬ 
jection.  The  equations  are  developed  both  for  velocity  fields 


and  displacement  fields. 

Let  (X,  Y,  Z)  represent  a  cartesian  coordinate  system 
which  is  fixed  with  respect  to  the  camera  (see  figure  3.1) 
and  let  (x,  y)  represent  a  corresponding  coordinate  system 
of  a  planar  image.  The  focal  length,  from  the  nodal  point 
O  to  the  image,  is  assumed  to  be  known.  It  can  be  normal¬ 
ised  to  1,  without  loss  of  generality.  Thus,  the  perspective 
projection  (x,  y)  on  the  image  of  a  point  ( X ,  Y,  Z)  in  the 
environment  is: 


x  =  XI Z,  y  =  Y/Z. 


(3.1a, b) 


z  St. 


Figure  3.1(redrawn  from  [LON80J):  A  coordinate 
system  (X,Y,Z)  attached  to  the  camera,  and  the 
corresponding  image  coordinates  (x,  y) .  The  image 
position  p  is  the  perspective  projection  of  the  poiat 
P  in  the  environment.  X.  =  (Tx,Ty,Tz)  and  Q  = 
(nx,ny,nz)  represent  the  relative  translation  and 
rotation  of  a  given  object  in  the  scene. 

The  motion,  relative  to  the  camera,  of  a  rigid  object  in 
the  scene  can  be  decomposed  into  two  components:  trans¬ 
lation  X  =  (Tx,Ty,Tz)  and  rotation  Q  =  (fix,  fly ,  flZ) . 
In  the  equations  corresponding  to  velocity  fields,  these  sym¬ 
bols  represent  instantaneous  spatial  velocities,  and,  in  the 
equations  corresponding  to  displacement  fields,  they  rep¬ 
resent  differences  in  position  and  orientation  between  two 
time  instances. 

In  the  velocity-based  scheme,  if  (X,  Y,  Z)  are  the  in¬ 
stantaneous  camera  coordinates  of  a  point  on  the  object, 
then  the  corresponding  projection  (x,  y)  on  the  image  moves 
with  a  velocity  (a,fi) ,  where  [LON80J: 

a  =  —  flx*y  +  fly(l  +  *J)  -  nzy  +  (Tx  —  Tzx)/Z  (3.2a) 


fi  =  — flx(l  +  V*)  +  f ly*y  +  f lz*  +  (Ty  -  Tzy)/Z.  (3.2b) 
Notice  that  (a,fi)  can  be  represented  as  the  sum 

(a,0)  =  (or,  fig)  +  (ar.fir),  (3.3) 


where  (ag,fig)  and  (ay,  fix)  are,  respectively,  the  rota¬ 
tional  and  translational  components  of  the  velocity  field: 

(*r  =  -Oxxy  +  ny(l  +  xJ)-nzy,  ay  =  (Tx  -Tzx)//, 

(3.4a, b) 

Pr  =  -flx(l  +  y2)  +  Oyxy  +  Ozx,  fix  =  (Ty  -  Tzy)//. 

(3.4c, d) 

In  the  displacement-based  scheme,  let  (JT,  V,  Z)  be  the 
camera  coordinates  at  time  1 1  of  a  point  on  the  object  and 
let  (X1,  V",  Z')  be  the  corresponding  coordinates  at  time 
h- 
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where  the  rotation  matrix  R  can  be  approximated,  assum¬ 
ing  small  values  of  the  rotation  parameters,  by: 

(1  -nZ  fly  ^ 

0 1  1  -fix  ]  •  (3.6) 

—fly  fix  1  / 

If  (x,y)  and  (x^y1)  are  the  image  coordinates  correspond¬ 
ing  to  the  points  (X,  Y,  Z)  and  (A*,  Y',  Z') ,  respectively, 
then: 

x'  =  —  =  X~nzy  +  Hy  +  Tx/Z  ,,  _  , 

Z'  -nyx  +  nXy+i  +Tz/Z  1  1 


j  _  XL  -  +  y  -  +  ty/z  . 

*  Z'  — Oyx  +  flxy  + 1  +  Tz/Z‘  1  ’ 

Now,  let  ( a,fi )  be,  in  this  case,  the  displacement  vector 
(x1  -  x,  yf  -  y) .  Then  from  (3.7)  we  get: 

_  -flx*y  +  fly (4  +  «2)  -  Ozy  +  (Tx  -  Tzx)/Z  .  , 

1  +  flxy  —  fl  yX  +  Tz/Z  ' 


_  -flx(l  +  V1)  +  nY*y  +  nZ*  +  (Ty  -  Tzy)/Z 
1  +  flxy  —  fly*  +  Tz/Z 

(3.8b) 

If  \Tz/Z\  <  1  and  the  field  of  view  of  the  camara,  i.e., 
the  visual  angle  corresponding  to  the  whole  image,  is  not 
very  large,  then  (employing  also  the  assumption  that  the 
rotation  parameters  are  small)  we  can  approximate  the  dis¬ 
placement  vector  (a,fi)  by  equations  (3.2). 

To  conclude:  equations  (3.2)  hold  not  only  for  veloc¬ 
ity  fields,  but  also  for  displacement  fields,  given  that  the 
rotation  parameters  are  small  and  that  the  /-component 
of  the  translation  is  small  relative  to  the  distance  of  the 


115 


object  from  the  image  plane.  Such  assumptions  are  rea¬ 
sonable  if  the  time  interval  between  the  two  image  frames 
is  short  enough  or  if  the  motion  is  slow.  In  the  following 
sections  we  restrict  ourselves  to  conditions  which  allow  us 
to  employ  equations  (3.2)  as  the  basis  of  our  analysis. 

3.2  The  Task  —  Inputs  and  Outputs 

The  input  utilised  by  our  scheme  for  interpreting  mo¬ 
tion  information  is  a  flow  field  described  by  {(a(z,  y),P(x,  y), 
W(x,  y))}  ,  where  (a(z,  y),  P(x,  y))  is  the  flow  vector  at  the 
(x,y)  pixel  in  the  image  and  W(x,y)  is  a  corresponding 
weight  between  0  and  1.  High  reliability  of  the  flow  vector 
is  represented  by  a  weight  close  to  1  and  low  reliability  by 
a  weight  close  to  0.  The  flow  field  can  be  either  dense,  thus 
defined  at  most  of  the  pixels,  or  sparse,  thus  defined  only 
on  a  sparse  subset  of  the  image  pixels.  If  the  flow  field  is 
undefined  at  a  pixel  (z,  y) ,  then  W (z,  y)  is  determined  to 
be  0.  A  rough  estimate  of  the  noise  level  in  the  flow  field  is 
assumed  to  be  known. 

The  interpretation  process  should  result  in  three  out¬ 
puts:  object  masks,  motion  parameters  and  depth.  We 
want  to  partition  the  set  {(z,y)  :  W(x,y)  >  0}  into  dis¬ 
joint  sets  of  pixels,  where  each  set  corresponds  to  a  different 
rigid  object.  The  pixels  corresponding  to  the  stationary  en¬ 
vironment,  where  the  optical  flow  is  induced  only  by  the 
camera  motion,  should  be  grouped  together. 

The  5  recoverable  motion  parameters  of  each  rigid  ob¬ 
ject,  relative  to  the  camera,  should  be  estimated.  These 
parameters  include  the  rotation  parameters  (Ox, fly,  flz) 
and  the  direction  of  the  translation  vector  defined  by  the 
unit  vector  {/.  =  £/r ,  where  r  is  the  length  of  the  transla¬ 
tion  vector  T.  ■  Once  the  motion  parameters  are  recovered, 
it  is  also  possible  to  estimate  the  relative  depth,  Z(x,  y )/r , 
corresponding  to  each  pixel  (z,  y)  where  a  flow  vector  is  de¬ 
fined,  unless  r  =  0  or  the  location  of  the  vector  is  exactly 
in  the  FOE. 


4.  SEGMENTATION 

In  this  section  we  develop  a  method  for  segmentation 
of  the  flow  field  into  connected  seta  of  flow  vectors,  where 
each  set  is  consistent  with  a  rigid  motion  of  a  roughly  planar 
patch.  A  segment,  satisfying  this  constraint,  is  very  likely 
to  correspond  to  a  portion  of  only  one  rigid  object.  Thus, 
the  data  is  organised  into  coherent  units  which  form  the 
basis  for  further  processing.  Another  purpose  of  the  seg¬ 
mentation  is  exclusion  of  incorrect  flow  vectors  which  are 
inconsistent  with  their  neighbors. 

4.1  ♦  Transformations  — 

A  Segmentation  Constraint 

In  order  to  achieve  a  useful  segmentation,  we  employ 
a  few  simple  observations  on  the  structure  of  optical  flow 
fields.  First,  we  examine  the  flow  field  induced  by  a  rigid 
motion  of  a  planar  surface.  Excluding  the  degenerate  case 
in  which  the  same  plane  contains  both  the  surface  and  the 


nodal  point  (and,  therefore,  the  corresponding  region  in  the 
image  is  a  straight  line),  the  surface  can  be  represented  by 
the  equation 

klX  +  k2Y  +  kiZ  =  1.  (4.1) 

The  coefficients  ki ,  k2  and  k2  can  be  any  real  numbers, 
except  the  case  in  which  all  of  them  are  sero.  Using  (3.1), 
we  obtain: 

l/Z  =  ktx  + k2y  +  k2.  (4.2) 

Substituting  (4.2)  in  (3.2),  we  realize  that,  given  a  relative 
motion  {Xj  Q)  ,  the  flow  field  is: 


a  =  oi  +  a2x  +  agy  +  otx2  +  agzy, 

(4.3a) 

P  =  at  +  05Z  +  agy  4-  ajxy  +  ogy2, 

(4.3b) 

“l  =  Ay  +  k2Tx, 

(4.4a) 

a2  =  kiTx  -  kjTz, 

(4.4b) 

03  -  -flz  +  k2Tx, 

(4.4c) 

04  =  -Ax  +  hTy, 

(4.4d) 

05  =  flz  +  kiTy, 

(4.4e) 

os  =  k2Ty  -  k2Tz, 

(4.4f) 

07  =  fly  -  kiTZ 

(4-4g) 

as  -  —fix  -  k2Tz. 

(4.4h) 

Note  that  similar  results  are  introduced  in  [WAX83].  Equa¬ 
tions  (4.3)  represent  what  we  shall  call  a  #  transformation. 
This  is  a  2-D  transformation  of  the  image  into  itself  based 
on  the  8  parameters  oi , . . . ,  og . 

We  proceed  now  with  another  observation,  related  to 
arbitrary  surfaces  in  the  environment.  Given  such  a  surface, 
it  can  be  described  as  a  function  Z  =  Z(x,  y)  defined  on 
the  image  region  R  which  corresponds  to  the  projection  of 
this  surface.  Let  Z'  =  Z'(x,  y)  be  an  approximation  to  the 
surface  Z  such  that 

|AZ(z,y)|  ^  | Z(x,y)  -  Z'(x,y) |  «  Z(x,y)  V(x,y)  6  R. 

(4.5) 

If  (otTtPr)  and  (a'T,P'T)  are  the  translational  components 
of  the  flow  fields  induced  by  the  same  motion  of  the  surfaces 
Z  and  Z' ,  respectively,  then 

_  Tx  -  Tzx  _  Tx  -  Tzx  _  Tx  -  Tzx  ,,  ,  A 
«r- - J, - Z-AZ  W - Z - (1  +  AZ/Z) 

=  ory  (1  +  AZ/Z)  (4.6a) 


,■  +  **/*, 

=  Pt(\  +  AZ/Z).  (4.6b) 
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The  rotational  component  of  the  flow  field  is  independent 
of  the  structure  of  the  environment.  Hence,  given  (4.5), 
the  flow  field  induced  by  the  approximating  surface  Z'  is 
very  similar  to  the  real  flow  in  the  region  R .  As  a  con¬ 
clusion,  if  Z'  is  a  planar  surface  which  satisfies  equation 
(4.5),  then  the  flow  field  in  R  can  be  approximated  by  a  4 
transformation. 

In  a  real  world  environment  the  surface  can  be  usually 
approximated  by  a  piecewise  planar  surface,  containing  only 
a  few  planar  patches,  where  the  distance  between  the  real 
surface  and  the  approximating  one  is  small  relative  to  the 
distance  from  the  sensor  to  the  surface.  If  this  is  the  case, 
then  the  flow  field  can  be  approximated,  reasonably  well, 
by  a  piecewise  V  transformation.  This  suggests  that  a 
useful  segmentation  of  the  flow  field  can  be  based  on  finding 
connected  sets  of  flow  vectors,  where  each  set  approximately 
satisfies  the  same  ♦  transformation.  Thus,  each  segment 
is  consistent  with  a  rigid  motion  of  a  roughly  planar  surface 
and  can  be  assumed  to  be  induced  by  the  relative  motion 
of  only  one  rigid  object.  In  the  next  section  we  describe  an 
algorithm  for  achieving  such  a  segmentation. 

4.2  Segmentation  Algorithm 

The  generalized  Hough  transform  technique  [BAL81a] 
is  a  useful  tool  for  grouping  together  flow  vectors  which  sat¬ 
isfy  the  same  2-D  parameterized  transformation  [ADI83J. 
In  this  technique,  the  set  of  relevant  transformations  is  rep¬ 
resented  by  a  discrete  multi-dimensional  parameter  space, 
where  each  dimension  corresponds  to  one  of  the  transfor¬ 
mation  parameters.  Each  point  in  this  space  uniquely  char¬ 
acterizes  a  transformation,  defined  by  the  corresponding 
parameter  values.  A  flow  vector  ‘votes’  for  each  point  with 
an  associated  transformation  consistent  with  this  vector. 
The  points  receiving  the  most  votes  are  likely  to  represent 
transformations  corresponding  to  large  segments  in  the  flow 
field. 

As  a  global  technique,  the  Hough  transform  is  rel¬ 
atively  insensitive  to  noise  and  partially  incorrect  or  oc¬ 
cluded  data.  However,  high  dimensionality  of  the  parameter 
space  requires  large  amounts  of  memory  and  computation 
time.  In  our  case,  the  segmentation  constraint  is  based  on 
the  8-parameter  ♦  transformations  (equations  (4.3)).  The 
Hough  technique  can,  in  principle,  be  employed,  but  the 
computational  cost  required  for  such  a  number  of  param¬ 
eters  is  very  high.  Therefore,  a  three-stage  algorithm  is 
proposed. 

The  first  stage  is  based  on  grouping  together  adjacent 
flow  vectors  into  components  consistent  with  affine  trans¬ 
formations.  The  affine  transformations,  represented  by 


a  =  oi  +  02X  +  ajy 

(4.7a) 

0  =  as  +  aiz  +  a«y, 

(4.7b) 

are  sub-class  of  the  9  transformations,  parameterized  by 
only  6  parameters.  Furthermore,  these  parameters  can  be 
partitioned  into  two  disjoint  sets  of  3  parameters  each,  cor¬ 
responding  to  equations  (4.7a)  and  (4.7b).  Thus,  the  group¬ 
ing  problem  in  the  first  stage  can  be  basically  solved  by 
applying  the  Hough  technique  to  3-dimensional  parameter 
spaces,  as  will  be  shown  in  sub-section  4.2.1. 

In  the  second  stage,  components  which  are  consistent 
with  the  same  9  transformation  are  merged  into  segments. 
Given  a  set  of  adjacent  components,  optimal  parameters  are 
computed  using  the  least-squares  technique.  Related  error 
measures,  associated  with  each  component  in  the  set,  can 
be  then  obtained.  If  these  error  values  are  not  high  (in  a 
sense  defined  in  sub-section  4.2.2),  then  the  components  are 
merged. 

Sometimes  over-fragmentation  may  occur  in  the  first 
stage  of  the  segmentation,  that  is,  a  segment  is  partitioned 
into  a  large  number  of  small  components,  as  demonstrated 
in  experiment  1  in  section  6.1  (see  figure  6.1b).  In  order 
to  reduce  the  computational  cost  of  the  first  and  second 
segmentation  stages,  the  grouping  of  vectors  belonging  to 
small  connected  sets  may  be  postponed,  in  such  a  case,  to 
the  third  stage.  In  this  stage,  flow  vectors  which  are  not 
contained  in  any  of  the  segments  are  merged  into  neighbor¬ 
ing  segments,  if  they  are  consistent  with  the  corresponding 
9  transformations.  If,  after  the  third  stage,  some  of  these 
small  sets  are  still  not  merged  into  the  existing  segments, 
then  the  first  and  second  stages  of  the  segmentation  may 
be  repeated,  focused  only  on  these  sets,  thus  possibly  cre¬ 
ating  new  segments.  In  the  following  sub-sections  the  three 
stages  of  the  segmentation  are  more  fully  described,  but, 
for  the  sake  of  brevity,  many  details  are  still  suppressed. 

4.2.1  Pint  Stage  —  Grouping  Based  on  Affine  Trans¬ 
formations 

4.2. 1.1  A  modified  version  of  the  generalised  Hough 
technique 

The  grouping  of  flow  vectors  into  components  consis¬ 
tent  with  affine  transformations  is  based  on  a  modification 
of  the  generalized  Hough  technique.  The  affine  transfor¬ 
mations  can  be  represented  by  a  6-dimensional  parameter 
space  where  each  dimension  corresponds  to  one  of  the  pa¬ 
rameters  at, . . . ,  a«  in  equations  (4.7).  For  computational 
reasons  the  parameter  space  should  contain  only  a  finite 
number  of  points.  Therefore,  minimal  and  maximal  values 
are  determined  for  each  parameter  and  the  corresponding 
interval  is  quantized.  The  parameter  space  is  the  cartesian 
product  of  the  obtained  sets. 

A  flow  vector  (a(z,  y),  /3(z,  y))  votes  for  a  transforma¬ 
tion  (aj , . . . ,  a#) ,  if  it  approximately  satisfies  the  constraint 
equations  (4.7),  that  is,  if 

S  V  y/s*  +  « *  <  e,  (4.8a) 

where 
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SM  =  |a  -  Oi  -  ajx  -  asvl  (4.8b) 

and 

St  =  \0  -  a4  -  aji  -  a«y|.  (4.8c) 

Note  that  e  is  a  function  of  the  resolution  in  the  parameter 
space  and  the  noise  level  in  the  flow  field,  but  it  is  never 
less  than  a  given  threshold,  typically  one  pixel.  In  this  case, 
the  amount  of  support  is  determined  by  the  function 

V(oi,aj,oj,04,os,a«,*,y)  =  1  -0.75 6/t  (4.9) 

which  allows  the  support  to  range  from  1  down  to  0.25  for 
those  flow  vectors  at  the  limit  of  the  acceptable  error  range. 
The  total  amount  of  support,  given  to  each  transformation 
(ai , . . . ,  a«) ,  is  the  weighted  sum 

5(01,02,03,04,05,0#) 

*.» 

where  W(x,  y)  is  the  weight  of  the  flow  vector  in  the  pixel 

(*.»)• 

Suppose  now  that  we  want  to  find  the  affine  transfor¬ 
mation,  among  those  represented  in  the  parameter  space, 
which  is  maximally  supported  by  a  given  set  of  flow  vec¬ 
tors.  Basically,  we  have  to  compute  the  support,  according 
to  equation  (4.10),  given  to  any  of  those  transformations.  A 
serious  computational  problem  may  arise  if  the  number  of 
points  in  the  parameter  space  is  very  high.  If,  for  example, 
the  minimal  and  maximal  possible  values  of  the  parameter 
ai  are  -64  pixels  and  +64  pixels,  respectively,  and  the  de¬ 
sired  accuracy  is  0.25  pixel,  then  512  samples  are  apparently 
needed  for  this  parameter.  If  an  equal  number  of  samples  is 
also  required  for  the  other  parameters,  then  the  parameter 
space  should  contain  512*  «  16  x  1015  points.  In  such  a 
case,  a  straightforward  Hough  technique  is  computationally 
impractical. 

This  problem  is  alleviated  by  using  two  techniques. 
First,  a  multi-resolution  scheme  in  the  parameter  space  is 
employed.  The  Hough  technique  is  iteratively  used,  where 
in  each  iteration  the  parameter  space  is  quantised  around 
the  values  estimated  in  the  previous  iteration,  using  a  finer 
resolution.  Thus,  utilising  a  limited  memory  sise,  accurate 
parameter  values  can  still  be  found.  Other  methods  for 
achieving  this  goal  are  presented  in  [OR081,  SL081]. 

The  second  technique  is  based  on  decomposition  of  the 
parameter  set  into  two  disjoint  subsets,  (<11,02,03}  and 
(04,05,04).  The  Hough  technique  is  separately  applied 
to  the  corresponding  3-dimensional  parameter  spaces,  us¬ 
ing  the  relevant  constraint,  (4.7a)  or  (4.7b).  Sets  of  highly 
supported  parameter  triples,  Aa  =  { (oi,,02,,03,)  :  i  = 
1, . . . ,  flf  }  and  Af}  —  (  (04,,  05,, 04^)  .  t  —  1,  •  -  * ,  AT }  ,  are 
thus  found,  where  AT  was  experimentally  determined  to  be 
10.  As  a  result,  a  set  of  AT3  hypothesised  affine  transforma¬ 
tions, 


Agfl  =  A  a  X  Aj j 

—  {(Oi,1,  02,1,  03,',  04y,  fljy,  Osy)  .  I,  j  —  l,...,Af),  (4.11) 

is  obtained.  The  support  function  can  be  then  directly 
applied  to  the  set  Aap ,  thus  determining  the  maximally 
supported  transformation  T*  in  this  set.  T'  is  not  nec¬ 
essarily  the  maximally  supported  transformation  in  the  6- 
dimensional  parameter  space.  However,  large  components 
in  the  flow  field,  corresponding  to  maxima  points  in  the 
6-dimensional  space,  can  be  expected  to  produce  maxima 
points  also  in  each  of  the  3-dimensional  parameter  spaces. 
Therefore  7"  is,  at  least,  a  near  optimal  transformation,  as 
can  also  be  concluded  from  the  experimental  results.  The 
decomposition  technique  is  employed  in  each  iteration  of 
the  multi-resolution  scheme;  together  they  create  a  very 
efficient  algorithm. 

4. 2. 1.2  Implementation  of  a  multipass  approach 

The  components  which  we  try  to  locate  are  connected 
sets  of  flow  vectors  which  support  the  same  affine  transfor¬ 
mation.  The  algorithm  for  obtaining  this  goal  is  based  on 
a  multipass  Hough  approach,  where  a  basic  cycle  of  oper¬ 
ations  is  repeatedly  executed  (FEN79,  ADI83].  The  input 
to  each  cycle  includes  masks  of  the  components  which  were 
already  detected  during  the  previous  cycles  and  a  mask  of 
those  vectors  which  were  excluded  from  further  considera¬ 
tion.  The  cycle  is  composed  of  the  following  steps: 

1)  Consider  the  set  of  pixels  which  are  assigned  a  pos¬ 
itive  weight,  do  not  belong  to  any  of  the  previously  found 
components  and  were  not  excluded  from  further  considera¬ 
tion.  Find  in  this  set  a  connected  subset  E  with  maximal 
sum  of  weights.  If  this  sum  is  below  a  given  threshold  L , 
which  is  related  to  the  noise  level  in  the  flow  field,  then  stop 
searching  for  new  components  and  start  the  merging  stage. 
Sometimes  over-fragmentation  occurs,  i.e.,  a  segment  is  par¬ 
titioned  into  a  large  number  of  small  components  .  In  order 
to  prevent  an  excessive  number  of  cycles  in  such  a  case,  a 
new  threshold,  higher  than  L,  is  determined  and  the  pro¬ 
cess  is  stopped  if  the  sum  of  weights  is  below  this  threshold. 
The  grouping  of  vectors  in  small  sets  is  thus  postponed  to 
the  third  stage. 

2)  Partition  the  set  E  into  a  given  number  (typically 
64)  of  square  windows,  such  that  the  sum  of  weights  in  each 
window  is  roughly  the  same.  Then,  from  each  window, 
select  the  flow  vector  with  maximal  weight.  The  Hough 
technique  will  be  applied  only  to  these  vectors,  and  not  to 
the  whole  set  E ,  in  order  to  reduce  the  computation  time. 

3)  Use  the  modified  Hough  technique,  described  in  sec¬ 
tion  4. 2. 1.1,  to  find  the  affine  transformation  which  receives 
the  maximal  support  from  the  flow  vectors  selected  in  the 
previous  step. 

4)  Determine  the  set  F  of  all  the  vectors  in  E  which 
are  consistent  with  the  computed  affine  transformation.  If 
the  sum  of  weights  corrsponding  to  F  is  below  the  threshold 
L ,  then  exclude  the  set  E  from  further  consideration  and 
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•tart  a  new  cycle.  Otherwise,  find  in  F  a  connected  subset 
G  with  maximal  sum  of  weights.  Then,  if  this  sum  exceeds 
L ,  add  G  to  the  list  of  components;  otherwise,  just  exclude 
G  from  further  consideration  (to  prevent  an  infinite  loop). 

4.2.2  Second  Stage  —  Merging  of  Components 

Components,  created  in  the  first  stage  of  the  segmen¬ 
tation,  are  atomic  units  which,  if  consistent  with  the  same 
'll  transformation,  should  be  merged  together  to  create  a 
segment.  Two  main  steps  can  be  observed  in  the  merging 
process.  In  the  first  step  an  optimal  4  transformation  is 
estimated  for  each  component,  employing  the  least  squares 
technique.  If  the  component  contains  n  flow  vectors,  then 
the  error  function  to  be  minimised  is 

»  , 

£(«i =  53vv»l(a,-ai-a2*i-ajy,— aT*?-«s*iW)1 
1  =  1 

+(&  -  ««  -  <>s*,  -  aey,  -  a7x,y,  -  atyf)2  ] , 

(4.12) 

where,  for  each  1  <  »  <  »,  (a„A)  =  (a(*,,y,),0(x,,y,))  is 
a  flow  vector  and  W,  is  the  corresponding  weight.  Taking 
partial  derivatives  with  respect  to  oi,...,ag  and  equating 
to  0,  a  set  of  linear  equations  is  obtained.  Their  solution, 
a J,...,a|,  is  the  optimal  9  transformation.  Substituting 
this  solution  in  (4.12)  and  using  the  normalization  equation 


an  error  value,  corresponding  to  the  component,  is  ob¬ 
tained.  o  is  an  estimate  of  the  standard  deviation  of  the 
actual  flow  values  from  those  predicted  by  the  optimal  4 
transformation. 

In  the  second  step  4  transformations,  correspond¬ 
ing  to  merged  sets  of  adjacent  components,  are  computed. 
Based  on  related  error  values,  associated  with  each  compo¬ 
nent  in  such  a  set,  it  is  decided  whether  to  merge  the  com¬ 
ponents.  In  order  to  formulate  the  conditions  for  a  merging 
decision,  some  notations  are  used.  First,  S  denotes  a  set  of 
adjacent  components  and  4  s  is  the  corresponding  optimal 
♦  transformation.  For  each  component  Cy  in  S ,  is  the 
error  value  found  in  the  first  step  of  the  merging  process,  o'- 
is  the  error  value  obtained  by  substituting  the  coefficients 
of  4s  in  (4.13),  and  py  is  the  ratio  between  the  sum  of 
vector  weights  in  Cy  and  the  total  sum  of  weights  in  the 
set  S . 

The  ratios  {Oy/oy  }  are  a  major  factor  in  the  merging 
decision.  If  these  ratios  are  only  slightly  higher  than  1,  then 
a  merging  decision  seems  to  be  justified.  Note  that  o‘j  is 
never  less  than  o}  ,  because  the  optimal  4  transformation 
corresponding  to  the  component  Cy  can  be  adjusted  to  the 
local  surface  and  noise  associated  with  Cy .  If,  however, 
p,  is  close  to  1,  then  we  expect  o'  to  be  very  close  to  Oy , 


especially  when  a  merging  decision  is  justified.  Hence,  the 
allowed  level  of  Oy/oy  will  be  defined  to  be  a  monotonically 
decreasing  function  of  py : 

£.(p>)  =  n  ~  (Ti  -  l)Pj,  (4.14) 

where  t\  is  a  given  threshold  (typically  t\  m  1.5).  Thus, 
L,(pj)  ranges  from  values  close  to  1  for  components  with 
relatively  large  weight,  up  to  almost  ri  for  small  compo¬ 
nents. 

Sometimes,  however,  oy  can  not  be  computed  because 
the  linear  equations  derived  from  (4.12)  are  linearly  depen¬ 
dent.  In  addition,  if  the  component  Cy  is  small,  then  cry 
may  be  unreliable  as  a  basis  for  evaluation  of  o'  values. 
Therefore,  an  absolute  threshold  £t(py)  of  allowed  values 
of  Oy  is  also  employed.  Lt(p y)  is  given  by 

U(Pj)  =  Ti  -  (?3  ~  n)Pj,  (4.15) 

where  r2  and  r3  are  pre-determined  thresholds  related  to 
the  expected  noise  level  in  the  flow  field  and  r3  >  r2 .  Thus 
ij(py)  ranges  from  r2  for  very  large  components  up  to  tj 
for  small  components.  The  reason  for  this  dependency  on 
py  is  related  to  the  effect  of  statistical  averaging  of  the  noise. 
In  large  objects,  such  averaging  is  likely  to  take  place  and 
thus  r2  represents  the  estimated  standard  deviation  of  the 
noise.  The  threshold  t3  ,  on  the  other  hand,  represents  some 
reasonable  upper  bound  of  the  noise  level.  If,  for  example, 
the  most  significant  noise  is  induced  by  using  flow  values 
rounded  to  integers  and,  therefore,  the  noise  is  uniformly 
distributed  between  -0.5  pixels  and  +0.5  pixels,  then  r2 
will  be  taken  to  be  the  corresponding  standrad  deviation, 
that  is,  approximately  0.3  pixels  and  r3  will  be  0.5  pixels. 
To  conclude,  a  merging  decision  is  accepted  if  and  only  if, 
for  each  component  Cy  in  S ,  o,-/oj  <  £«(py)  or  cr'  < 

MP;) .  «•«•» 

o'j  <  max{La(py)<ry,  i+(py)}.  (4.16) 


The  algorithm,  for  finding  sets  of  components  to  be 
merged,  starts  with  detection  of  the  component  with  the 
maximal  sum  of  vector  weights.  Then,  merging  of  this  com¬ 
ponent  with  its  neighbors  is  sequentially  tested,  in  the  order 
of  their  associated  sums  of  weights.  If  one  of  these  merging 
trials  is  successful,  then  merging  of  additional  components 
with  the  already  merged  pair  is  examined.  In  general,  given 
a  set  of  already  merged  components,  neighboring  compo¬ 
nents  are  tested  as  candidates  for  adjoining  this  set.  This 
process  continues  until  all  the  candidates  for  merging  are 
examined.  Then,  the  process  starts  again,  considering  only 
the  components  which  are  not  yet  assigned  to  any  of  the 
already  created  segments.  Evantually,  all  the  components 
are  contained  in  one  of  the  segments. 
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4.2.S  Third  Stag* 

The  purpose  of  the  third  stage  of  the  segmentation  is 
examination  of  flow  vectors  which  were  assigned  positive 
weights  and  were  not  grouped  into  any  of  the  components 
in  the  first  stage  of  the  segmentation,  ?>nd  thus  do  not  be¬ 
long  to  any  of  the  segments.  Such  vectors,  called  O-vectors, 
which  are  neighbors  of  one  of  the  segments,  are  tested  for 
consistency  with  the  ♦  transformation  corresponding  to 
this  segment  and,  if  consistent,  are  merged  into  it.  Then, 
O-vectors,  neighbors  of  the  just  segmented  vectors,  are  ex¬ 
amined  in  their  turn.  This  process  is  iteratively  executed 
until  no  new  vector  is  merged  into  one  of  the  segments. 

It  is  possible  that  after  the  third  stage,  connected  sets 
of  O-vectors,  which  were  not  excluded  from  further  consider¬ 
ation  in  the  first  segmentation  stage,  are  still  not  contained 
in  any  of  the  existing  segments.  In  such  a  case,  the  first  and 
the  second  stages  of  the  segmentation  are  executed  again, 
focused  only  on  these  sets,  thus  possibly  creating  new  seg¬ 
ments. 

5.  FORMING  OBJECT  HYPOTHESES  AND 
RECOVERING  3-D  INFORMATION 

In  the  first  stage  of  the  interpretation  process,  de¬ 
scribed  in  the  preceding  section,  the  flow  field  is  segmented 
into  connected  sets  of  flow  vectors,  where  each  set  is  consis¬ 
tent  with  a  rigid  motion  of  a  roughly  planar  surface.  Such 
a  segment  is  assumed  to  correspond  to  a  portion  of  only 
one  rigid  object.  The  next  task  is  to  detect  sets  of  segments 
which  are  consistent  with  the  same  3-D  motion  parameters. 
Such  a  set  can  be  hypothesized,  employing  the  rigidity  as¬ 
sumption  [ULL79],  to  be  induced  by  one  rigidly  moving 
object  (or  by  the  camera  mot  n).  In  sub-section  5.1  we 
describe  an  algorithm  for  computing  the  motion  parame¬ 
ters  from  a  set  of  flow  vectors  generated  by  a  rigid  motion. 
In  section  5.2  we  combine  this  algorithm  with  the  segmen¬ 
tation  results  to  form  object  hypotheses  and  estimate  the 
corresponding  3-D  motion  and  structure.  Again,  for  the 
sake  of  brevity,  many  details  are  suppressed. 

5.1  Estimating  Motion  Parameters  and 
Depth  Information  of  a  Rigid  Object 

5.1.1  Optimisation  Constraint 

Given  a  set  of  flow  vectors,  assumed  to  be  induced  by 
a  rigidly  moving  object,  we  want  to  find  the  3-D  motion 
parameters  which  are  maximally  consistent  with  this  data. 
Following  [BRU81],  we  employ  the  least-squares  approach 
because  of  its  relative  robustness  in  the  presence  of  noise. 
Based  on  (3.2),  the  error  function  to  be  minimized  is 

5_>.  [(a.  +  n,vr.y.  -  nt  (!+»?)  +  nzy,  -  T~zX)2 

1=1  * 

+  (ft  +  n.vd+y?)  -  n,  x,y,  -nz*.  -  —zTzy)7}, 


where  I  =  ( Tx,Ty,Tz )  and  Q  =  (Ox.ftr.nz)  are  the 
translation  and  rotation  vectors,  respectively,  and,  for  each 
«  between  1  and  n,  (or,,  ft)  is  the  flow  vector  computed 
at  the  pixel  (x,,y,j,  W,  is  its  weight  and  ft  is  the  spa¬ 
tial  depth  of  the  corresponding  point  in  the  environment. 
The  task  is  to  determine  £,  Q  and  {ft}  which  minimize 
this  function.  Using  the  decomposition  of  the  flow  field 
into  its  rotational  and  translational  components,  denoted 
by  (or,Pr)  and  (ar,M  (see  equations  (3.4)),  the  error 
function  can  be  more  concisely  represented  by 

53  Wil(«i  ~  am  “  aT ,)2  +  (ft  -  Pm  ~  Pt,)2]-  (5-2) 


As  can  be  easily  seen,  it  is  actually  impossible  to  de¬ 
termine  the  absolute  values  of  (Tx,Ty,Tz)  and  {  ft  :  t  = 
l,...,n}.  However,  if  the  length,  denoted  by  r,  of  the 
translation  vector  is  non-zero,  then  it  is  possible  to  esti¬ 
mate  the  direction  of  the  3-D  transl'  tion,  represented  by 
the  unit  vector 

(Ux ,  Ur,  UZ)  =  (Tx,Ty,Tz)/r,  (5.3) 

and  the  relative  depth  values,  represented  by 

ft  =  r/Z„  i=I,...,n.  (5.4) 

Introducing  the  abbreviations 

Qtj  =  Ux  —  Uz *  =  <*tIZ  (5.5a) 

and 

Pu  =  UY  -  Uzy  =  Pt/Z ,  (5.5b) 

(5.2)  can  be  rewritten  as 

53  W,  [(a,  -  qr,  -  op, ft)2  +  (ft  -  Pm  ~  Pu.Z <)*]  - 

1  =  1 

(5.6) 

Thus,  the  task  can  be  reformulated  as  finding  the  values 
of  (nY,nv,nz),  ( Ux,Uy,Uz )  and  {ft  :  f  =  l,...,n} 
which  minimize  this  expression.  In  addition,  the  depth  con¬ 
straints 

ft>0,  »=1, ...,»,  (5.7) 

should  be  satisfied.  Note  that  this  error  measure  is  different 
from  the  one  employed  in  [BRU81]  where  the  contribution 
of  each  flow  vector  is  multiplied  by  op2  +  ftr2 . 

For  any  given  » ,  I  <  »  <  n ,  we  can  find  the  optimal 
value  of  ft ,  as  a  function  of  the  motion  parameters,  by 
examining  the  first  derivative  of  (5.6)  with  respect  to  ft . 
This  derivative  is  given  by 

2W,  | -(a,  -  oRl)o(7,  -  (ft  -  Pr{)Pu %  +  (ap?  +  Pa 2)Z,j  . 

(5.8) 


> 


Setting  it  equal  to  0  yields 

2i  =  ((“.  -  a/J.  W.  +  (A  - 0Ri)Pui) /(au*  +  Poi)>  (5-9) 

unless  ap?  +  Pu2  =  0 ,  in  which  case  Z,  can  be  assigned 
any  non-negative  value.  If  the  expression  in  (5.9)  is  neg¬ 
ative,  then  the  corresponding  depth  constraint  in  (5.7)  is 
unsatisfied.  In  such  a  case,  to  minimize  the  error  function 
(5.6),  Z,  should  be  set  to  0,  since  the  derivative  (5.8)  is 
non-negative  for  non-negative  values  of  Z,  and,  therefore, 
the  error  function  is  monotonically  non-decreasing  for  these 
values.  To  summarize,  the  optimal  value  of  Z,  is  given  by 

*  =  [«./(«(;?  +/%?)]+,  (5.10) 

where  <5,  =  (a,  -  aR,)aa,  +  (/?,  -  Pr,)Pu,  ■  Substituting 
(5.10),  for  any  1  <  «  <  n,  into  (5.6)  and  expanding  the 
resulting  expression  yields  the  following  representation  of 
the  error,  as  a  function  of  the  motion  parameters: 

n 

£(E,Q)  =  £>,£.,  (5.11a) 

1  =  1 

where 

if*>0 

E,  =  a„J  + 

.  (<*.  -  or.)2  +  [Pi  -  Pr,)2  otherwise. 

(5.11b) 

A  normalized  version  of  this  error  function,  defined  by 


be  considered  further  in  this  paper.  Thus,  in  the  next  sec¬ 
tion,  we  concentrate  on  the  much  more  difficult  task  of  find¬ 
ing  values  of  U_  and  fl  which  minimize  the  error  function 
E(U,Q)  (or,  equivalently,  the  function  <r(U, Q)),  where  If 
can  be  any  unit  vector  and  Q  is  unconstrained. 

5.1.2  Algorithm 

The  algorithm  for  recovering  the  motion  parameters 
employs  an  error  measure,  derived  from  (5.12),  correspond¬ 
ing  to  possible  directions  of  the  translation  vector.  A  min¬ 
imum  value  of  this  function  is  determined,  using  a  multi¬ 
resolution  sampling  scheme. 

Let  us  start  the  derivation  of  this  error  measure  with 
the  observation  that  if  the  depth  constraints  (5.7)  are  ig¬ 
nored,  then,  for  any  hypothesized  direction  of  translation, 
the  optimal  rotation  parameters  can  be  easily  extracted  by 
solving  a  set  of  three  linear  equations.  To  see  that,  notice 
that  the  error  function  (5.11)  can  be  reduced  in  this  case  to 
the  function 

\2 

-  <xr,)Pu,  -  ( Pi  -  Pm)avi J 

<*t72  +  Po* 

(5.14) 

Differentiating  E'([/,Q, )  with  respect  to  the  rotation  pa¬ 
rameters  and  setting  the  derivatives  equal  to  0  yields  three 
linear  equations  with  the  rotation  parameters  as  unknowns. 
Thus,  ignoring  the  depth  constraints  (5.7),  the  search  space 
can  be  limited  to  the  unit  sphere  {  H  ■  \U\  =  1 }  • 

Moreover,  changing  the  sign  of  any  unit  vector  {/  has 
no  effect  on  the  value  of  E'(H,  Q)  since  it  only  affects  the 
sign  of  op  and  Pu .  Therefore,  the  search  space  can  be 
further  restricted  to  the  hemisphere 


£'(E,Q)  =  £>* 


(< 


"(£,0)  =  .  E(U.,Q)/tlWi<  («•«) 

'  \  .=i 

will  be  also  utilized.  <r  is  an  estimate  of  the  standard  devi¬ 
ation  of  the  measured  flow  values  from  those  predicted  by 
the  motion  parameters  and  the  corresponding  depth  values. 

Note  that  the  expression  (5.11)  for  the  error  function 
was  obtained  by  assuming  non-zero  translation.  In  the  case 
of  a  purely  rotational  motion,  the  appropriate  error  function 
to  be  minimized  is: 

fl 

Er( Q)  =  £>*((“.  -  OR,)2  +  (p,  -  pR,)2).  (5.13) 

i=i 

If  the  minimal  value  of  this  function  is  close  to  the  miAimal 
value  of  the  function  (5.11),  then  the  motion  is,  possibly, 
purely  rotational. 

The  task  of  finding  the  rotation  parameters  which  min¬ 
imize  the  function  Er(Q )  can  be  easily  accomplished  by 
solving  a  set  of  three  linear  equations  |BRU81|  and  will  not 


HS  =  {U:\U\  =  l  and  Uz>0).  (5.15) 

The  preferred  sign  of  U_  can  be  then  determined,  as  pro¬ 
posed  in  [BRU81],  as  the  one  which  gives  Z,  >  0  for  most 
indices  i .  Still,  we  wish  to  incorporate  these  constraints  or, 
equivalently,  the  equations  (5.11b)  in  a  more  rigorous  way. 
Hence,  for  each  f/  in  HS  ,  we  define  the  error  measure 

ctj  (L/.)  =  min  a(BU_,Q),  (5.16) 

where  B  can  have  the  values  +1  or  -1  .  The  goal  is  to 
find  a  vector  ££  in  HS  which  minimizes  the  function  Oj . 
The  associated  values  of  B  and  0  are>  respectively,  the 
determined  sign  of  the  translation  vector  and  the  estimated 
rotation  parameters.  The  function  oj  is,  however,  difficult 
to  compute.  Therefore,  in  the  proposed  algorithm  we  com¬ 
pute  an  approximation  to  a i  which  is  experimentally  shown 
to  be  very  accurate.  A  few  main  steps  can  be  distinguished 
in  the  procedure  for  computing  this  approximation: 

1)  Given  a  vector  H  in  HS ,  estimate  the  optimal  ro¬ 
tation  vector  Q*  by  minimizing  E’(U_,  Q)  with  respect  to 


Q,  and  compute  the  corresponding  normalized  error  mea¬ 
sure  Q*)  •  This  error  value  is  a  lower  bound  of  oi[U) 
since  it  minimizes  the  error  function  <r(£7,  Q) ,  with  respect 
to  Q  and  the  sign  of  U_ ,  without  considering  the  depth 
constraints  (5.7). 

2)  Compute  <r(ILQ*)  and  Determining 

the  minimum  of  these  two  error  values  yields  the  preferred 
sign,  denoted  by  n,  of  U .  Using  the  notation  U_‘  =  pt/, 
a(!L *,Q")  is  an  upper  bound  of  <7i(U),  because  it  gives 
the  actual  error  measure  for  some  values  of  B  and  Q  in 
equation  (5.16). 

3)  Compute  an  approximation  to  by  averaging 

its  lower  and  upper  bounds: 

*i m  =  (<r'(£,fl*)  +  <T(£*,Q*))  /2.  (5.17) 

The  relative  deviation  of  di(£/)  from  a\ (U)  is  bounded  by 

WE’.fl*)  -  AIL, a*))  /  (2 *,(£)) .  (5.18) 

In  the  experiments,  this  value  was  found  to  be  very  small, 
typically  much  less  than  0.01. 

The  search  for  an  optimal  vector  in  HS  consists  of 
a  sampling  (similar  to  [LAW82])  of  the  error  measure  . 
A  multi-resolution  scheme  is  employed,  where  in  the  first 
iteration  the  set  HS  is  coarsly  sampled  and,  in  each  addi¬ 
tional  iteration,  only  the  neighborhood  of  the  vector  giving 
a  minimum  value  in  the  previous  iteration  is  sampled,  using 
a  finer  resolution.  Note  that  solutions  near  the  boundary  of 
HS  require  a  vector  £/'  to  be  defined  as  a  ‘neighbor’  of  a 
vector  IL  if  either  £/'  or  -U_'  is  close  to  U.  Another  way 
to  obtain  the  same  effect  while  using  the  normal  definition 
of  a  neighborhood  is  to  extend  the  domain  of  definition  of 
the  function  <7|  to  the  whole  unit  sphere,  employing  exactly 
the  same  definition  used  for  the  domain  HS  .  In  this  case, 
di(-t/.)  =  <7|(U)  f°r  each  unit  vector  U_,  thus,  computa¬ 
tionally,  it  makes  no  difference  which  domain  of  definition 
is  used. 

The  final  solution  of  U_,  and  the  corresponding  sign 
M  and  the  rotation  parameters  Q*  ,  defined  in  the  proce¬ 
dure  for  computing  <tj  ,  are  the  determined  motion  param¬ 
eters.  Substituting  these  parameters  in  equations  (5.10), 
the  relative  depth,  corresponding  to  each  flow  vector,  can 
be  estimated  as  well. 

We  should  mention  that  sometimes  the  error  function 
di  is  very  close  to  its  minimal  value  in  a  large  portion  of 
the  search  space  (see  figure  6.2e).  Hence,  in  the  presence  of 
noise,  it  may  be  impossible  to  obtain  reasonably  accurate 
estimates  of  the  motion  parameters.  Two  complementary 
approaches  may  be  taken  in  order  to  deal  with  this  am¬ 
biguity.  First,  constraints  on  the  motion  parameters  and 
the  environmental  depth,  rather  than  values,  can  be  still 
recovered,  using,  for  example,  the  coefficients  of  the  related 
♦  transformations  (see  equations  (4.4)).  Second,  possible 
values  of  the  motion  parameters  can  be  represented  by  a 


probabilistic  distribution  function.  Such  a  function  can  be 
defined,  for  example,  on  the  set  HS ,  using  the  computed 
values  of  .  Investigation  of  situations  which  may  lead  to 
this  ambiguity  is  under  way. 

6.2  Forming  Object  Hypotheses 

Segments  of  the  flow  field,  which  are  consistent  with 
the  same  motion  parameters,  can  be  hypothesized,  using 
the  rigidity  assumption  [ULL79],  to  be  induced  by  one  rigidly 
moving  object  (or  by  the  camera  motion).  The  process  for 
detecting  such  sets  of  segments  is  similar  to  the  second  stage 
of  the  segmentation  process,  where  components  are  merged 
into  segments.  Optimal  motion  parameters  and  a  related 
error  measure  Af,  are  computed  for  each  segment  S EG, , 
using  the  algorithm  described  in  the  previous  section.  In 
addition,  given  any  set  of  segments,  the  algorithm  is  ap¬ 
plied  to  this  set  and  the  corresponding  motion  parameters 
are  computed.  Then,  for  each  segment  SEG,  in  the  set,  an 
error  measure  K  is  obtained  by  substituting  these  param¬ 
eters  and  the  related  flow  data  in  equation  (5.17).  Based 
on  the  error  values  {A/,}  and  {Af'},  consistency  of  the 
set  with  rigid  motion  is  determined,  employing  a  decision 
procedure  similar  to  the  one  described  in  section  4.2.2. 

Actually,  each  segment  is  sampled,  using  the  method  in 
step  (2)  of  the  multipass  Hough  technique  (section  4.2. 1.2), 
and  only  the  selected  vectors  are  used  for  forming  object 
hypotheses  and  computing  the  corresponding  motion  pa¬ 
rameters.  This  sampling  procedure  considerably  reduces 
the  computation  time.  Notice  that,  because  each  segment  is 
sampled,  all  the  distinct  surfaces  and  independently  moving 
objects,  even  the  small  ones,  are  appropriately  represented, 
thus  preventing  the  suppression  of  valuable  data. 

In  addition  to  the  ambiguity  described  in  the  previous 
section,  another  ambiguity  may  exist  in  the  decomposition 
of  the  environment  into  independently  moving  objects.  For 
example,  two  independently  moving  objects  induce,  in  some 
cases,  a  flow  field  which  can  be  interpreted  as  resulting  from 
one  rigidly  moving  object.  In  order  to  deal  with  this  ambi¬ 
guity,  one  may  have  to  find  a  set  of  possible  decompositions, 
not  only  one.  Analysis  of  this  ambiguity  is  also  under  way. 

6.  EXPERIMENTS 

In  this  section  we  present  four  expe-:ments  which 
demonstrate  our  proposed  scheme  for  the  interpretation  of 
optical  flow  fields.  The  first  two  experiments  are  based  on 
simulated  data,  and  the  last  two  are  baaed  on  real  data. 
In  all  the  experiments,  values  to  appear  in  translation  vec¬ 
tors  and  surface  equations  are  given  in  focal  units,  whereas 
rotation  parameters  are  given  in  radians  and  flow  vectors 
are  given  in  pixel  units.  Actually,  the  flow  values  in  the 
experiments  based  on  simulated  data  are  rounded  to  in¬ 
tegers,  thus  inducing  noise  uniformly  distributed  between 
-1/2  and  +1/2  pixels.  The  methods  employed  for  com¬ 
puting  the  real  data  in  experiments  3  and  4  also  produce 
flow  values  given  in  integer  units,  hence  the  noise  level 


in  these  experiments  should  be  at  least  as  high  as  in  exper¬ 
iments  1  and  2  (actually  it  is  higher).  The  image,  in  all  the 
experiments,  contains  128  x  128  pixels.  The  field  of  view  of 
the  camera  is  45°  in  the  experiments  with  simulated  data 
and  30°  in  the  experiments  with  real  data. 

0.1  Experiment  1 

The  first  experiment  simulates  a  translatory  motion  of 
the  camera,  represented  by  the  vectors  Xc  =  (0..0.02, 1.) 
and  =  (0.,0.,0.) .  The  environment  consists  of  two  dis¬ 
tinct  surfaces:  a  plane  described  by  the  equation  Z  =  50V + 
100  and  an  ellipsoid  represented  by  ( X  -  2)2+[(y  -  2)/4jJ4 
(Z  -  5)2  =  1 .  A  flow  vector  is  computed  for  each  pixel,  un¬ 
less  the  corresponding  ray  of  light  does  not  intersect  any  of 
the  surfaces,  in  which  case  the  related  weight  is  assumed  to 
be  0  (otherwise  it  is  1).  A  sample  of  the  flow  field  is  shown 
in  figure  6.1a. 

The  results  of  the  three  stages  of  the  segmentation, 
shown  in  figures  6.1b,  6.1c  and  6.1d,  demonstrate  the  role 
and  importance  of  each  of  these  stages.  The  two  segments, 
found  in  this  process,  were  determined  to  be  consistent  with 
the  same  rigid  motion.  The  error  function  b\  (equation 
5.17)  was  computed  using  64  vectors  from  each  segment. 
Employing  a  spherical  coordinate  system  (r,  <f>,  8) ,  where 

X  =  rsin(^)cos(0),  (6.1a) 

Y  =  rsin(4>)sin(0)  (6.1b) 

and 

Z  =  rcos(^),  (6.1c) 

the  domain  of  definition  of  d,  ,  that  is,  the  hemisphere  (  tj_ : 
|C/|  =  1,  Uz  >  0  }  ,  can  be  represented  by  the  set 

{  (<^,0)  :  0  <  ^  <  90°, 0°  <  8  <  360°  }.  (6.2) 

This  representation  is  utilized  for  displaying  the  function 
di  in  figure  6.1e,  where  (<(>,8)  are  used  as  polar  coordi¬ 
nates.  Employing  the  sampling  procedure  for  minimizing 
b\ ,  the  motion  parameters  were  determined,  after  two  it¬ 
erations,  to  be  U_  =  (0.0017,-0.0204,-0.9998)  and  Q  = 
(-0.0004,-0.0003,-0.0004).  Note  that,  assuming  a  sta¬ 
tionary  environment,  the  camera  motion  is  given  by  -Jf 
and  -Q .  These  results  are  in  a  good  agreement  with  the 
correct  values.  Substituting  the  computed  values  in  equa¬ 
tion  (5.10),  the  ‘reciprocal  depth’  map,  that  is,  the  function 
r/Z  shown  in  figure  6. If,  was  obtained. 

0.2  Experiment  2 

In  the  second  experiment,  the  camera  motion  is  com¬ 
posed  of  both  translatiou  and  rotation,  described  by  Xc  = 
(0.5, 0.5,1.)  and  =  (0.02,-0.02,0.05).  The  environ¬ 
ment  contains  an  independently  moving  sphere,  defined  by 
(X  -  9)2  +  (Y  -  9)2  +  (Z  -  30)2  =  4  .  An  object  coordinate 
system  is  defined,  which  is  parallel  to  the  camera  coordi¬ 
nate  system  but  its  origin  is  in  the  sphere  center  (9, 9, 30) . 
The  motion  of  the  object,  in  this  coordinate  system,  is 


represented  by  Xo  =  (0.5, —0.5,0.)  and  =  (0.,0., -0.2) . 
The  stationary  environment  is  composed  of  two  surfaces:  a 
plane  described  by  Z  =  X  +  0.5F  +  50  and  an  ellipsoid  de¬ 
scribed  by  [(X  +  3)/2)2  +  ( (Y  +  l)/5j2  +  | [Z  -  20)/2)J  =  1 . 
A  32  x  32  sample  of  the  flow  field  corresponding  to  this 
scene  is  shown  in  figure  6.2a. 

The  segments  found  in  the  experiment  are  shown  in 
figure  6.2b.  The  two  segments  associated  with  the  sta¬ 
tionary  environment  were  determined  to  be  consistent  with 
the  same  rigid  motion,  while  no  rigid  motion  compatible 
with  the  third  segment  was  also  found  to  be  consistent 
with  one  of  the  other  segments.  Thus,  the  decomposi¬ 
tion  of  the  flow  field  into  sets  corresponding  to  indepen¬ 
dently  moving  objects  could  be  uniquely  (and  correctly) 
determined.  The  error  function  b\  corresponding  to  the 
stationary  environment  is  displayed  in  figure  6.2c.  The  as¬ 
sociated  motion  parameters  of  the  camera  were  determined 
to  be  -{7  =  (0.3897,0.4017,0.8287)  (the  corresponding  ac¬ 
tual  values  were  U_c  —  (0.4082,0.4082,0.8164))  and  -Q  = 
(0.0204,-0.0196,0.0494).  The  related  depth  map  is  repre¬ 
sented  by  the  function  r/Z  in  figure  6.2d. 

The  error  function  corresponding  to  the  independently 
moving  object  is  shown  in  figure  6.2e.  This  function  is 
very  close  to  its  minimal  value  in  a  large  portion  of  the 
search  space,  thus,  demonstrating  the  ambiguity  discussed 
in  section  5.1.2. 


i 


/  /  / 
>  / ' 


’  I  !  ' 

i  ;  i  i 


l  i 


m 
>  '  \  \ 


Zi.LLi.l  i  U  UAA_Y.\Aj 


Figure  0.1:  Experiment  1.  (a)  A  sample  of  the 
flow  field. 


Figure  6.1,  continued:  (b)  Components  determined  in  the 
first  step  of  the  segmentation.  Each  component  is  repre¬ 
sented  by  a  specific  pattern.  The  small  areas  with  the  dens¬ 
est  pattern  correspond  to  vectors  which  are  not  contained  in 
any  of  the  components.  The  irregular  shapes  of  the  compo¬ 
nents  were  caused  by  the  tound-off  error,  (c)  Segments  ob¬ 
tained  by  merging  components  consistent  with  the  same  4 
transformation,  (d)  Final  segmentation.  Note  that  almost 
all  the  flow  vectors  which,  after  the  first  stage,  had  not  been 
included  in  any  of  the  components,  were  merged  into  the 
segments,  (e)  The  error  function  di ,  shown  upside-down, 
defined  on  the  hemisphere  {ff  :  |f£|  =  1,  Uz  >  0  }  .  The 
spherical  coordinates  (<p,9) ,  employed  in  equation  (6.2)  for 
representing  this  hemisphere,  are  used  here  as  polar  coordi¬ 
nates.  (f)  The  function  r/Z ,  where  r  is  the  length  of  the 
translation  vector  and  Z  is  the  environmental  depth.  The 
length  of  each  bar  represents  the  relative  value  of  r/Z  at 
the  image  pixel  corresponding  to  the  attached  dot. 
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Figure  8.2:  Experiment  2.  (a)  A  32  x  32  sample  of  the 
flow  field,  (b)  Final  segmentation,  (e)  The  erro  motion 
<T| ,  shown  upside-down,  corresponding  to  the  stationary 
environment,  (d)  The  depth  function  r/Z  corresponding 
to  the  stationary  environment.  The  round-off  error  has 
a  strong  effect,  especially  in  the  upper  right  corner,  near 
the  focus  of  expansion,  (e)  The  error  function  <xi ,  shown 
upside-down,  corresponding  to  the  moving  object. 


6.3  Experiment  3 

The  third  experiment,  taken  from  [RIE83|,  is  based  on 
real  data.  In  this  experime  it  the  camera  was  translated 
roughly  in  the  direction  of  the  Z-axis ,  between  two  tex¬ 
tured  cylinders,  towards  a  textured  plane  parallel  to  the 
image  plane,  and  then  rotated  about  its  T  axis  a  few  de¬ 
grees.  Figure  6.3a  shows  the  flow  vectors  determined  for  a 
set  of  interesting  points  extracted  from  the  image;  for  more 
details  on  how  the  flow  field  was  extracted  see  [RIE83|.  The 
weight  assigned  to  each  vector  is  1,  since  no  reliability  mea¬ 
sure  was  computed. 

The  results  of  the  segmentation  process  are  shown  in 
figure  6.3b.  The  three  segments  found  in  this  process  are 


compatible  with  the  same  earner*  motion.  Figure  6.3c  dis¬ 
plays  the  corresponding  error  function  A| .  Assuming  sta 
tionary  environment,  the  recovered  motion  parameters  of 
the  camera,  =  (-0.0079,0.0181,0.9998)  and  -fl  = 
(-0.0018,0.0203,  -0.0006),  ar>  consistent  with  the  specifi¬ 
cations  of  the  experiment.  The  estimated  values  of  r/Z, 
shown  in  figure  6.3d,  have  a  large  variation  in  the  central 
part  of  the  image,  whereas  the  actual  values  are  approxi¬ 
mately  constant  in  this  area.  These  errors  are  caused  by 
the  presence  of  noise  in  the  flow  values  near  the  focus  of  ex¬ 
pansion  and  are  unavoidable  in  such  circumstances.  Note 
that  this  experiment  demonstrates  the  ability  of  our  scheme 
to  interpret  sparse  flow  fields. 


Figure  6.3:  Experiment  3.  (a)  The  flow  field  produced  in  [RIE83],  (b)  Final  segmentation. 
Each  segment  is  represented  by  a  distinct  shape,  and  the  black  dote  correspond  to  flow  vectors 
which  are  not  contained  in  any  of  the  segments,  (c)  The  error  function  shown  upside-down, 
(d)  The  estimated  depth  function  r/Z . 
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0.4  Experiment  4 

Figure*  6.4a  and  6.4b  are  images  taken  from  a  cam¬ 
era  translated  in  the  direction  of  its  X-axis.  The  scene 
mainly  contains  a  coffee  can  in  the  front  and  a  plant  in  the 
background.  The  flow  field,  shown  in  figure  6.4c,  was  com¬ 
puted  using  a  modified  version  of  the  algorithm  proposed  in 
[GLA83].  The  new  version,  as  well  as  a  reliability  measure 
assigned  to  each  flow  vector,  was  developed  by  Anandan 
|ANA84|.  Based  on  this  reliability  measure,  a  weight  plane, 
shown  in  figure  6.4d,  was  computed. 

The  four  segments  in  figure  6.4e  were  determined  to  be 
compatible  with  the  same  motion  parameters.  The  corre¬ 
sponding  error  function  is  shown  in  figure  6.4f.  The  opti¬ 
mal  motion  parameters  of  the  camera,  obtained  by  min¬ 
imising  this  function,  are  -U  =  (l.,0.,0.)  and  -Q  = 
(0.0000,0.0019,0.0001) .  These  results  are  very  close  to  the 
correct  ones.  Figure  6.4g  shows  the  corresponding  ‘recipro¬ 
cal  depth’  map,  namely,  r/Z . 

7.  SUMMARY 

We  have  presented  a  new  approach  for  the  interpreta¬ 
tion  of  optical  flow  fields  which  are  induced  by  motion  of 
the  camera  as  well  as  motion  of  several  rigid  objects  in  the 
environment.  The  interpretation  goals  of  decomposing  the 
flow  field  into  sets  corresponding  to  independently  moving 
objects,  recovery  of  motion  parameters,  and  estimation  of 
relative  depth  of  environmental  surfaces  were  shown  to  be 
feasible.  An  algorithm  based  on  our  approach,  was  demon¬ 
strated  to  work  with  sparse,  noisy  and  par  ally  incorrect 
data,  derived  from  both  artificial  and  real  images 

An  hierarchical  structure,  based  on  four  levels  of  orga¬ 
nization  in  the  flow  field,  has  been  employed.  In  the  inter¬ 
pretation  process  units  from  each  level  are  combined  into 
larger  units  in  the  next  level  based  on  their  consistency  with 
appropriate  parameter  values.  Thus,  flow  vectors,  consis¬ 
tent  with  an  affine  transformation,  are  combined  into  one 
component;  then,  components  that  are  compatible  with  the 
same  ♦  transformation  are  merged  into  a  segment;  and,  fi¬ 
nally,  sets  of  segments  which  satisfy  the  same  3-D  motion 
parameters  are  hypothesized  to  correspond  to  one  rigid  ob¬ 
ject.  The  techniques  for  computing  the  parameter  values  in 
each  level  has  been  based,  whenever  possible,  on  solving  lin¬ 
ear  equations  derived  from  the  least-squares  criterion.  Oth¬ 
erwise,  sampling  techniques  combined  with  multi-resolution 
search  schemes,  have  been  employed.  Combining  all  these 
techniques  together,  an  effective  and  efficient  algorithm  has 
been  developed. 

In  some  situations,  however,  there  exists  an  inherent 
ambiguity  in  the  interpretation  of  noisy  flow  fields,  as  was 
briefly  discussed  and  demonstrated  in  sections  5  and  C.  In 
our  future  work  we  will  characterize  such  situations  and  pro¬ 
pose  appropriate  modifications  of  the  interpretation  goals, 
based  on  the  ideas  already  mentioned  in  section  5.  These 
modifications  will  be  basically  aimed  at  issues  of  repre¬ 
sentation  of  information  which  can  be  extracted  even  in 


(a) 


(b) 


Figure  6.4:  Experiment  4.  (»)  The  first  intensity 
image,  (b)  The  second  intensity  image. 
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Figure  8.4,  continued:  (c)  A  32  x  32  sample  of  the 
computed  flow  field,  (d)  The  weight  plane.  High 
values  are  represented  by  bright  gray  levels,  (e) 
Final  segmentation.  The  white  ~.reas  correspond 
to  flow  vectors  assigned  weight  0.  The  areas  with 
the  densest  pattern  correspond  to  unsegmented  vec¬ 
tors.  (f)  The  error  function  d,  shown  upside-down. 
Note  the  two  peaks  which  actually  correspond  to 
the  same  translation,  because  &i  is  invariant  to  sign 
change  in  the  translation  vector,  (g)  The  estimated 
depth  function  r/Z . 
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ambiguous  cases.  Integration  of  such  information  over  a 
time  sequence  of  flow  fields  may,  eventually,  resolve  the 
ambiguity  and  result  in  a  unique  interpretation. 
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ABSTRACT 

A  new  concept  in  passive  ranging  to  moving  objects 
is  described  which  is  based  on  the  comparison  of  multiple 
image  flows.  It  is  well  known  that  if  a  static  scene  is 
viewed  by  an  observer  undergoing  a  known  relative  trans¬ 
lation  through  space,  then  the  distance  to  objects  in  the 
scene  can  be  easily  obtained  from  the  measured  image 
velocities  associated  with  features  on  the  objects  (i.e. 
motion  stereo).  But  in  general,  individual  objects  are 
translating  and  rotating  at  unknown  rates  with  respect  to 
a  moving  observer  whose  own  motion  may  not  be  accu¬ 
rately  monitored.  The  net  effect  is  a  complicated  image 
flow  field  in  which  absolute  range  information  is  lost 
However,  if  a  second  image  flow  field  is  produced  by  a 
camera  whose  motion  through  spare  differs  from  that  of 
the  first  camera  by  a  known  amount,  the  range  informa¬ 
tion  can  be  recovered  by  subtracting  the  first  image  flow 
from  the  second.  This  “difference  flow"  must  then  be 
corrected  for  the  known  relative  rotation  between  the  two 
cameras,  resulting  in  a  divergent  relative  flow  from  a 
known  focus  of  expansion.  This  passive  ranging  process 
may  be  termed  Dynamic  Stereo,  the  known  difference  in 
camera  motions  playing  the  role  of  the  stereo  baseline. 
We  present  the  basic  theory  of  this  ranging  process,  along 
with  some  examples  for  simulated  scenes.  Potential 
applications  are  in  autonomous  vehicle  navigation  (with 
one  fixed  and  one  movable  camera  mounted  on  the  vehi¬ 
cle),  coordinated  motions  between  two  vehicles  (each  car¬ 
rying  one  fixed  camera)  for  passive  ranging  to  moving 
targets,  and  in  industrial  robotics  (with  two  cameras 
mounted  on  different  parts  of  a  robot  arm)  for  intercept¬ 
ing  moving  workpieces. 

1.  INTRODUCTION 

Passive  ranging  by  triangulation  methods,  employed 
so  successfully  by  humans,  has  received  much  attention 
in  the  machine  vision  literature  in  recent  years  (I).  It  is 
obvious  that  the  ability  to  recover  absolute  range  to 
objects  in  a  scene  would  be  important  in  a  variety  of 
robotic  applications.  And  passive  techniques  to  do  so  are 
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certainly  of  interest.  To  date,  only  two  basic  methods  of 
passive  ranging  have  been  discussed,  “static  stereo” 
(employing  two  cameras  separated  by  a  known  baseline) 
and  “motion  stereo"  (utilizing  a  single  camera  moving  in 
a  known  way  through  a  stationary  scene).  In  this  paper 
we  introduce  a  new  concept  in  passive  ranging  to  moving 
objects,  termed  “dynamic  stereo,”  which  is  based  on  the 
comparison  of  multiple  image  flows. 

By  far,  mos!  of  the  literature  on  passive  ranging  has 
been  concerned  with  the  difficult  “correspondence  prob¬ 
lem"  associated  with  the  assignment  of  stereo  disparities 
(see  the  many  references  cited  in  [1] ).  In  addition  to  the 
traditional  method  of  intensity  correlation  between 
images,  much  interest  has  been  payed  to  the  theory  of 
Marr  and  Poggio  [2|,  with  its  implementation  by  Crimson 
[3]  and  recent  insights  of  Nishihara  [4],  as  well  as  the 
theory  of  May  hew  and  Frisby  [5].  The  use  of  more  than 
two  camera  locations,  to  aid  in  solving  the  correspon¬ 
dence  between  images,  has  been  approached  in  different 
ways  by  Moravec  [G]  and  Tsai  (7).  Nevertheless,  solution 
of  this  correspondence  problem  remains  a  computation¬ 
ally  expensive  and  slow  process.  Moreover,  a  maximum 
ranging  distance  is  implied  by  the  finite  resolution  of  the 
cameras  and  the  statically  configured  baseline  between 
cameras. 

In  principle,  the  difficulties  encountered  with  “static 
stereo”  can  be  overcome  using  “motion  stereo.”  Here,  a 
single  camera  is  moved  through  space  in  a  known  way, 
while  imaging  a  stationary  scene.  The  result  is  that,  over 
a  period  of  time,  the  camera  traverses  a  known  physical 
baseline  of  arbitrary  length.  In  addition,  correspondence 
is  established  by  tracking  features  over  the  small  inter¬ 
frame  distances  to  build  up  effective  stereo  disparities, 
which  then  yield  range  values  (8,  0).  In  practice,  prob¬ 
lems  can  arise  from  inaccuracies  in  the  camera  motion 
parameters.  However,  a  more  fundamental  limitation  is 
the  necessity  of  a  stationary  scene.  If,  while  the  camera 
is  moving  from  one  end  of  the  baseline  to  the  other, 
objects  in  the  scene  are  also  moving  but  with  unknown 
velocities,  then  the  relative  motions  between  camera  and 
objects  would  not  be  known  and  hence,  the  physical  base¬ 
line  would  be  undetermined.  In  such  cases,  absolute 
range  information  is  lost,  though  relative  range  (such  as 
object  surface  slopes)  and  scaled  motion  parameters  can 
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be  recovered,  though  not  always  uniquely  [10,  11).  Actu¬ 
ally',  once  unknown  object  motions  are  admitted,  we  are 
really  in  the  realm  of  the  more  general  “structure  from 
motion"  problem  which  also  has  a  long  history,  but  has 
recently  been  addressed  in  the  context  of  “image  flow" 
theory  [10  -  13]. 

Dynamic  Stereo  can  be  viewed  as  an  extension  of 
motion  stereo,  applicable  to  scenes  containing  moving 
objects.  It  employs  two  cameras  in  known  relative 
motion,  both  imaging  a  scene  containing  independently 
moving  objects  (which  are  assumed  rigid).  The  relative 
rigid  body  motions  between  objects  and  cameras  generate 
image  flows  (of  feature  points  and  contours)  at  each  cam¬ 
era.  Differences  between  the  two  flow  fields  are  mainly 
due  to  the  known  relative  motion  between  the  two  cam¬ 
eras.  This  fact  will  be  exploited  below  in  order  to  recover 
absolute  range  to  the  objects  in  an  evolving  scene. 
Dynamic  stereo  can  then  be  used  in  conjunction  with  the 
image  flow  derived  from  a  single  camera  in  order  to 
recover  surface  shape  as  well  as  absolute  motion  parame¬ 
ters  for  objects  in  the  scene  [11], 

We  expect  that  dynamic  stereo  can  be  utilized  in  a 
variety  of  configurations.  In  the  context  of  autonomous 
land  vehicles,  one  can  mount  on  the  vehicle  one  fixed 
camera  and  one  sliding  camera  in  order  to  range  to  mov¬ 
ing  vehicles  in  the  scene.  The  larger  distances  typically 
associated  with  flight  would  require  two  cameras,  each 
mounted  on  separate  aircraft  moving  with  respect  to  each 
other  at  known  relative  speeds.  This  kind  of  coordinated 
flight  could  enable  passive  ranging  to  moving  targets. 
There  is  also  potential  use  in  industrial  robotics  for  han¬ 
dling  moving  objects,  by  configuring  two  cameras  on 
different  parts  of  a  robot  arm  such  that  they  experience  a 
relative  motion.  It  should  be  noted,  however,  that  any 
application  will  require  that  at  some  point  in  their  rela¬ 
tive  motion,  the  two  cameras  become  sufficiently  close  so 
that  their  respective  images  can  be  easily  brought  into 
correspondance. 

This  paper  presents  the  basic  theory  of  Dynamic 
Stereo  along  with  several  simulated  examples.  The  con¬ 
cepts  associated  with  “relative  image  flows”  are  described 
in  Section  2,  which  follows.  Section  3  then  addresses  the 
recovery  of  range  to  moving  objects  using  these  relative 
flows.  Filtering  techniques  to  reduce  the  effects  of  noise 
on  the  required  image  velocities  are  discussed  as  well. 
Concluding  remarks  are  presented  in  Section  4. 

2.  RELATIVE  IMAGE  FLOWS 

The  flow  fields  measured  at  each  camera  correspond 
to  the  time-varying  projection  of  object  surface  texture, 
due  to  the  relative  rigid  body  motions  between  objects  in 
the  scene  and  the  cameras.  The  equations  relating  image 
velocity  to  relative  space  motion  and  distance  to  points  in 
the  scene  have  been  derived  in  other  studies  (e  g  [10  - 
12));  they  are  given  by 

-  \T~f  +  l*»n.V-  U  +  yDz],  (la) 


», =  { y  ~T  "  ~t }  +  I*1  +  v2)0*-  *n,|. 


Fig.  1  -  Spatial  coordinates  moving  with  the  observer, 
and  image  coordinate  system. 

Figure  1  illustrates  the  coordinate  systems  of  the  observer 
(A',  Y,  Z),  to  whom  the  relative  translations 
( l'v<  l'y.  l-'z)  ai>d  rotations  (f)A,  fly,  ltz )  are  ascribed 
(and  which  may  differ  for  each  rigid  object  in  the  scene), 
and  his  image  plane  (at,  y)  which  has  been  reinverted  and 
scaled  to  a  focal  length  of  unity.  As  the  directions  to 
points  (or  objects)  in  the  scene  are  specified  by  their 
image  coordinates  (r,  y) ,  their  absolute  range  is  deter¬ 
mined  by  the  component  of  distance  Z,  along  the 
observer's  line  of  sight.  It  is  seen  from  equations  (I)  that 
the  distance  Z  appears  in  ratio  with  the  translational 
motion  parameters.  Clearly,  if  the  motion  parameters 
were  all  known  (e  g.  a  camera  moving  through  a  station¬ 
ary  scene),  the  distance  Z  could  be  obtained  directly  from 
measured  image  velocities;  this  is  simply  "motion  stereo." 
Hut  if  the  objects  are  also  moving,  then  these  relative 
motion  parameters  are  unknown  as  is  the  distance  Z  to 
points  (or  "scale  factor"  Z0,  slopes  and  curvatures  in  the 
case  of  surface  patches).  The  theory  of  single  image  flows 
addresses  this  problem  of  recovering  both  surface  struc¬ 
ture  and  space  motion  [10  -  13);  however,  solutions  are 
obtained  in  a  form  which  are  scaled  by  the  factor  Z0 
That  is.  absolute  range  is  not  recoverable  from  single 
image  flows. 

The  mathematical  basis  of  dynamic  stereo  is  simple; 
one  need  only  note  from  (1)  that  image  velocitie »  are 
linearly  proportional  to  the  parameter*  of  epaee  motion. 
We  can  make  this  more  explicit  by  rewriting  (1|  syn.boli- 
callv  as 


r  ( /.  y) 


Z(i,  y} 


T(i.  y)  V+  /?(/,  y)  tl  , 


where  the  elements  of  the  translation  and  rotation 
matrices,  T  and  R,  are  functions  only  of  the  image  coor¬ 
dinates,  and  may  be  read  directly  from  equations  (1) 

In  a  dynamic  stereo  configuration,  each  camera  has 
its  own  set  of  relative  motion  parameters  (for  each  object 
in  the  scene),  and  generally,  the  two  image  planes  are  not 
coincident.  However,  we  can  configure  the  two  cameras 
such  that  they  come  very  close  to  each  other  at  some 
time  during  their  own  relative  motion.  At  that  instant, 
the  two  image  planes  are  nearly  coincident  and 
correspondence  should  be  easily  established.  If  we  refer 
the  measured  image  velocities  to  that  moment  in  time, 
then  we  may  treat  the  matrices  ?(/,  y)  and  R  (z,  y)  for 
the  two  cameras  as  identical,  as  if  the  two  image  planes 
were  indeed  coincident  momentarily.  (More  will  be  said 
of  the  residual  effects  of  a  small  but  finite  baseline  in  Sec¬ 
tion  3.4  below.)  Now  if  the  relative  motion  between  cam¬ 
eras  is  given  by  AV  =  ( A  Fy,  A  V'y,  AVZ)  and  Ail  = 
(Ally,  Ally,  Allz  ),  then  the  second  camera  sees  an  aug¬ 
mented  image  flow  of  ®  +  A®,  where  from  (2),  the 
“difference  flow"  A®  is  given  by 


A®(x.  y)  =  ?(/,  y)  AV+  R(i,  y)  All  (3) 

As  the  distance  information  is  contained  in  the  transla¬ 
tional  term  in  (3),  we  can  bring  the  known  rotational 
term  to  the  left-hand  side  and  define  a  “relative  flow”  or 
“modified  difference  flow”  as 

Ai®  =  A®  R  All  =  y  T  A V  (I) 

This  relative  flow  takes  the  form  of  a  divergent  flow  field 
with  a  known  focus  of  expansion  (as  long  as  \  0). 

as  is  seen  from  writing  the  individual  components  of  (4), 

A«t’,  =  Ap  (i  */„)  (5») 

Air,  =  — ^  (y  -  yf") ,  (sb) 

where 

A  V  jy  AVy 

'/•-Air  and  */«=av7-  (5c’d) 


Figures  2a,  b,  c  illustrate  examples  of  such  divergent  rela¬ 
tive  flows.  In  each  case,  a  simulated  scene  is  constructed 
and  feature  points  are  marked.  The  two  sets  of  space 
motion  parameters  are  selected,  corresponding  to  the  rela¬ 
tive  motion  between  objects  and  each  of  the  two  cameras 
(the  difference  between  these  sets  of  parameters  being  the 
relative  motion  between  cameras).  The  individual  flows 
of  the  feature  points  are  displayed  along  with  the  diver¬ 
gent  relative  flow.  No  noise  effects  were  introduced  into 
this  simulation.  Figures  2  were  created  with  the  Image 
Flow  Simulator  [14). 


Fig.  2  -  Examples  of  the  relative  flow  fields  for: 

a)  a  complex  scene  with  feature  points  marked, 

b)  a  planar  surface  with  a  grid  of  feature  points, 

c)  a  small  planar  patch  with  few  feature  points. 
For  each  case  the  simulated  seme  is  shown,  along 
with  the  ideal  flows  for  each  set  of  camera  moth  n 
parameters  and  the  divergent  relative  flow. 
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3.  RANGING  TO  MOVING  OBJECTS 

Equations  (4)  and  (5)  form  the  basis  of  the  method 
for  recovering  range  to  points  (on  moving  objects  in  the 
scene)  and  planes  (surface  patches  on  moving  objects).  If 
it  were  not  for  the  effects  of  noise  on  measured  image 
velocities,  the  method  would  be  simple  and 

straightforward,  as  will  be  presented  in  Section  .1.1  below. 
The  inevitable  effects  of  digitization  error  and  noise  have 
led  us  to  explore  two  filtering  methods;  "radial  flow  filter¬ 
ing”  as  motivated  by  the  divergent  nature  of  the  relative 
flow  noted  in  (5),  and  "second-order  flow  filtering"  which 
stems  front  the  Velocity  Functional  Method  developed  by 
Waxman  and  Wohn  (12].  These  techniques  are  described, 
along  with  examples,  in  Sections  3.2  and  3.3,  respectively. 
Section  3.1  considers  the  effects  of  the  finite  baseline 
between  cameras,  at  their  closest  approach. 

3.1.  Ranging  to  Points  and  Planes 

Given  the  measured  image  velocities  of  correspond¬ 
ing  features  on  both  image  planes  (determined  when  the 
two  cameras  are  at  closest  approach),  we  form  the  meas¬ 
ured  difference  flow  values  A®.  Then,  according  to  the 
definition  in  (4),  we  can  compute  the  relative  flows  by 
correcting  these  difference  values  for  the  known  relative 
rotation  between  cameras, 

Aw,  =  An,  -  \xy  A  fly  -  (1  +  x 2)  Afly  +  y  Aflz  j,  (6a) 


—  ( 1  pi  -  qy)  = 


Ate,  =  Ar, 


(1  +  y1)  AGy  ry  Afly  -  z  Afi^]. 


According  to  equations  (5a-d),  the  ideal  relative  flow 
diverges  from  (or  converges  to)  a  known  focus  located  at 
(if,,,  y/ „,).  Thus,  we  can  define  the  "radial  relative  flow" 
as  simply 


Atcr  3  |  (A«’,): 


+  (Attv 


Then  according  to  (5),  we  can  solve  for  the  range  Z, 


A  \\  ,'/2 

Z[I-  y)  =  a~(77)  ,j  +  ,y  -  I  •  (7b) 


This  result  applies  to  individual  feature  points  in  the 
scene  whose  relative  flow  has  been  derived  from  measured 
image  velocities  on  both  cameras.  All  terms  on  the 
right-hand  side  of  (7b)  are  either  known  or  measured. 

If  a  number  of  feature  points  are  believed  to  be  the 
images  of  points  on  a  planar  surface  in  space,  then  their 
individual  range  values  Z(z,  y)  can  be  used  to  fit  a  planar 

surface.  Alternatively,  the  parameters  of  the  surface  can 
be  obtained  collectively  from  (7b).  A  planar  surface  in 
space,  Z  =  Z0  +  pX  +  qY,  can  be  written  exactly  in 
image  coordinates  as  Z  =  Z),  (1  pi  -  qy)  '.  Inverting 
(7b)  then  yields 


Auy  (x,  y ) 

A  V, 


! 1  Z/J2  +  ( y  - 


This  equation  can  serve  as  the  basis  for  a  linear  least- 
squares  approach  to  determine  the  parameters  of  the 
plane,  Z0X,  p/Z0  and  q/Z 0. 

3.2.  Radial  Flow  Filtering 

In  the  absence  of  noise  and  digitization  effects,  equa¬ 
tions  (7)  are  exact  and  yield  perfect  results,  putting  aside 
for  the  moment  the  issue  of  finite  separation  between 
cameras  at  all  times.  In  our  simulation,  we  are  aHe  to 
explore  the  effects  of  noise  by  perturbing  the  individual 
flow  values  before  forming  the  difference  flow.  We  con¬ 
sidered  a  uniform  distribution  of  noise,  up  to  a  specified 
percentage,  superposed  on  the  individual  components  of 
image  velocity. 

The  effects  of  noise  on  estimated  image  velocities  v 
are  amplified  by  the  differencing  procedure  in  obtaining 
An.  Since  the  order  of  magnitude  of  c  is  |  V  |  / Z,  while 
that  of  An  is  |  A  F  |  /Z,  the  noise  effects  are  amplified  by 
the  ratio  |  V  |  /  | A  V  |  when  referred  to  An.  Clearly,  it  is 
important  to  have  as  large  a  relative  velocity  between 
cameras  A  V  as  is  possible.  This  translates  over  time  to 
building  up  as  large  a  separation  between  cameras  as  is 
practical. 

One  simple  thing  that  can  be  done  to  reduce  the 
effects  of  noise  is  to  perform  a  "radial  filtering"  on  the 
relative  flow.  This  notion  stems  from  the  divergent 
nature  of  the  ideal  relative  flow  discussed  in  Section  2. 
As  definition  (7a)  is  meant  to  imply  that  the  ideal  relative 
flow  should  consist  of  vectors  emanating  radially  from  a 
known  focus  of  expansion,  we  can  impose  this  constraint 
on  the  relative  flow  derived  from  noisy  velocity  measure¬ 
ments.  That  is,  we  consider  only  that  component  of  the 
relative  flow  which  points  radially  from  a  known  focus. 
The  orthogonal  component  (or  “azimuthal  relative  flow”) 
can  be  used  to  ascertain  the  magnitude  of  the  noise;  it 
vanishes  in  the  limit  of  ideal  velocity  measurements. 

Figure  3  illustrates  the  simulated  effects  of  noise 
(!0r7j  on  the  image  velocities  and  the  relative  flow  which 
results,  in  the  case  of  a  planar  surface.  By  using  only  the 
radial  component  of  the  relative  flow,  one  essentially 
reduces  the  noise  by  a  factor  of  two.  As  one  might 
expect,  the  errors  in  range  should  scale  like 
(r<'  noise  )  X  |  V  |  /  |  A  V  J  .  The  results  of  our  simula¬ 
tions  bear  this  out.  For  a  ratio  ]  AV|  /  )  V  |  of  about 
1/10,  reasonable  range  estimates  could  be  found  only 
when  the  noi>e  imposed  was  less  than  a  few  percent.  In 
the  case  of  planar  surfaces,  the  same  is  true  for  recovery 
of  the  scale  factor  Z0.  The  slopes  of  the  surface,  which 
describe  the  differential  changes  in  range,  are  extremely 
sensitive  to  noise,  as  might  be  expected.  Their  determi¬ 
nation  requires  noise  below  one  percent.  In  practice,  one 
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could  not  expect  surface  slope  recovery  from  dynamic 
stereo;  the  individual  image  flows  seem  more  appropriate 
for  this  task  [10,  12].  However,  qualitative  recovery  of 
range  from  relative  flows  does  appear  feasible  For  the 
case  of  ranging  to  surfaces,  one  can  do  better  still,  by 
using  the  following  Altering  technique. 


Fig.  3  -  Effects  of  109e  noise  on  feat r  re  point  velocities 

for  the  planar  surface  example  (value  found  for 

Z0  is  indicated  in  each  case): 

a)  the  relative  flow  obtained  directly  from 
the  noisy  velocities, 

b)  the  relative  flow  after  “radial  flow  filtering”, 

c)  the  relative  flow  obtained  from 
"second-order  flow  filtering", 

d)  the  ideal  relative  flow. 

3.3.  Second-Order  Flow  Filtering 

The  drawback  of  “radial  flow  filtering”  is  that  it 
operates  on  the  relative  flow  rather  than  on  the  in  divi¬ 
dual  flows  preceding  the  differencing  operation.  As  this 
differencing  procedure  tends  to  amplify  noise,  the  radial 
filtering  is  not  sufficient  to  recover  entirely  from  the  pro¬ 
cess.  What  is  needed  is  a  filtering  process  which  reduces 
the  noise  effects  on  the  individual  flows,  before  their 
difference  is  taken.  When  the  image  velocities  utilized 
arise  from  features  on  objects  at  various  depths  moving 
with  different  space  motions,  no  smoothing  of  the  indivi¬ 
dual  flows  can  be  performed  as  the  image  velocities  them¬ 
selves  are  unrelated.  However,  when  features  (points  or 
contours)  arising  from  single  objects  are  isolated,  their 
image  velocities  constitute  a  locally  second-order  flow 
field  [11,  12,  13],  and  this  can  be  used  to  filter  out  the 
effects  of  noise  on  the  individual  image  flows,  before 
differencing. 

The  representation  of  image  flows  in  terms  of 
se^nd-uider  flow  fields  has  been  termed  the  Velocity 
Functional  Method  by  Waxman  and  Wohn  [12],  It  is  a 
globally  valid  representation  in  the  case  of  planar  surfaces 
[12],  and  a  locally  valid  one  for  curved  surfaces  [13],  The 
coefficients  of  the  second-order  polynomials  are  simply 
related  to  the  derivatives  of  the  flows  via  their  truncated 
Taylor  series  about  a  local  origin.  Let  vtlt  ( x ,  y\  be  the 
flow  in  image  k  (k  =  1,  2),  and  be  its  tik  partial 

derivative  with  respect  to  x  and  /*  partial  with  respect  to 
y  (where  i  +  j  <  2),  evaluated  at  a  local  origin  in  an 
image  neighborhood.  We  have,  locally 
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The  coefficients  are  obtained  from  a  linear  least-squares 
fit  to  measured  velocities  in  an  image  neighborhood.  For 
point  features,  both  components  of  image  velocity  are 
available,  and  (8)  then  provides  two  constraints.  In  the 
case  of  contours  evolving  in  geometry  over  time,  only  a 
no*mal  velocity  (perpendicular  to  the  contour)  is  percepti¬ 
ble,  and  (8)  provides  only  one  constraint  in  the  form  of  a 
linear  combination  of  the  two  velocity  components  (cf. 
[12]  for  details).  When  only  evolving  contours  are  used,  a 
minimum  structure  is  required  of  a  single  contour  if  it  is 
to  yield  all  twelve  coefficients  in  (8).  It  is  generally  more 
robust  to  use  several  contours  in  a  given  neighborhood 
[12].  The  flow  field,  as  reconstructed  from  (8),  is  typically 
much  cleaner  than  the  measured  image  velocities  them¬ 
selves,  and  is  therefore  more  suitable  for  forming  the 
difference  flow. 

With  expressions  in  the  form  of  (8)  for  corresponding 
neighborhoods  in  the  two  images,  we  can  form  the 
difference  flow  simply  by  subtracting  the  coefficients  of 
the  respective  representations.  Then,  correcting  for  the 
relative  rotations  as  in  (t»),  we  have  for  the  components  of 
relative  flow 
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Equations  (9)  for  the  relative  flow  are,  themselves, 
second-order  polynomials  in  the  image  coordinates.  They 
can  be  equated  term-for-term  with  the  expanded  form  of 
equations  (5),  written  for  a  surface  Z  (ar,  y).  In  the  case  of 
a  planar  surface,  equations  (5)  yield 

=  \  ^  V*  +  * P  A  Vx  +  A  **)  *  -  * 9  A  K*>  9 
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By  comparing  the  coefficients  of  (10)  with  those  deter¬ 
mined  in  (0),  one  sees  that  the  parameters  of  the  plane 
[Zg,  p,  q)  can  be  determined  from  the  zeroth-order  and 
first-order  terms  directly.  Non-planar  surfaces  will  lead 
to  modifications  of  the  second-order  terms  (and  generate 
bigher-order  terms  as  well).  As  one  would  expect  the 
zeroth-order  terms  of  (0)  to  be  the  most  precise,  it  should 
be  clear  from  ( 10)  that  Zg  can  be  recovered  most  accu¬ 
rately  if  A  Vx  and  A  Vy  are  non-zero.  This  is  born  out  by 
our  simulations  in  which  reasonable  results  for  Zg  could 
be  obtained  even  at  10%  noise  levels,  with 
|  AVI  /  |  Vis;  0.1. 

In  our  simulations,  once  the  coefficients  in  (0)  have 
been  determined,  we  have  found  it  more  convenient  to 
return  to  (7)  and  solve  for  the  parameters  of  the  plane  by 
a  least-squares  procedure  (having  sampled  (0)  over  a 
sparse  grid),  rather  than  literally  identify  coefficients  with 
equations  (10).  In  general,  it  is  best  to  have  all  com¬ 
ponents  of  A^  non-zero.  By  filtering  the  individual  flows 
before  differencing,  one  is  able  to  recover  surface  scale 
factors  Z0  even  in  the  presence  of  noisy  image  velocities. 
Figure  3  also  shows  the  relative  flow  for  a  plane  after 
second-order  filtering;  it  is  clearly  a  divergent  flow.  Fig¬ 
ure  4  illustrates  the  case  of  two  elliptic  contours  on  a 
planar  surface,  with  their  individual  normal  flows,  one  of 
the  full  flows  recovered  by  the  Velocity  Functional 
Method,  and  the  relative  flow  from  which  Zg  is  recovered. 
This  same  example,  with  normal  velocities  perturbed  by 
50%,  is  shown  in  Figure  5. 

In  our  simulations  we  have  tried  to  ascertain  the 
effects  of  noise  and  field  of  view  on  recovery  of  Zg.  In 
general,  when  utilizing  second-order  flow  filtering,  we 
have  found  recovery  of  Zg  from  point  feature  velocities  to 
be  possible  at  10%  noise  levels,  while  utilizing  evolving 
contours  allows  recovery  of  Zg  even  with  50%  perturba¬ 
tions  to  the  normal  flow  around  the  contour.  Moreover, 
ranging  is  possible  down  to  fields  of  view  of  about  3°, 
below  which  variations  in  image  velocity  become  so  small 
that  the  method  becomes  unstable  at  noise  levels  exceed¬ 
ing  about  5%.  Typically,  surface  slopes  cannot  be  reli¬ 
ably  determined  from  dynamic  stereo  in  the  presence  of 
noise;  however,  they  can  be  found  from  the  analysis  of 
single  image  flows  [10,  12)  (and  then  utilized  with 
dynamic  stereo  to  recover  Z0  somewhat  more  accurately). 
Finally,  we  attempted  to  recover  Zg  to  curved  surfaces, 
treating  them  as  if  they  were  planar.  As  one  might 
expect,  when  variation  in  range  over  a  surface  is  small 
compared  to  absolute  range  to  that  surface,  the  method 
succeeds  in  recovering  the  scale  factor  Zg.  The  slopes 
obtained  can  be  used  to  find  the  approximate  distance  to 
individual  feature  points  on  the  curved  surface.  However, 
when  range  variations  are  large  (as  compared  to  a  planar 
surface  at  comparable  Z0),  then  the  recovery  of  Zg  can  be 
rather  sensitive  to  noise.  This  result  would  favor  small 
fields  of  view,  where  substantial  range  variation  is 
unlikely;  although  too  small  a  field  of  view  (below  about 
3°)  leads  to  noise  sensitivity  as  well. 


Fig  4  - 


The  use  of  "normal  flow”  along  contours  on  a 
planar  surfat  Z  =  Zg  +  0. 1  A'  +  0.2  )' : 
a  A.  b)  the  ideal  normal  flows  for  each  set  of 
camera  motion  parameters, 

c)  the  full  flow  for  (a)  recovered  by  the 
Velocity  Functional  Method, 

d)  the  relative  flow  found  from  the  recovered 
second-order  flows. 


Fig.  5  -  Same  as  Fig.  4,  but  with  50%  noise  superposed 
on  the  normal  flows. 


3.4.  Effects  of  the  Finite  Baseline 

The  underlying  basis  of  Dynamic  Stereo  is  the  com¬ 
parison  of  image  flows  obtained  from  two  cameras  in 
known  relative  motion.  This  notion  is  implicit  in  the 
“relative  flow"  discussed  in  Section  2.  But  in  forming  the 
required  “difference  flow”,  we  treated  the  translation  and 
rotation  matrices  of  the  two  cameras  as  identical,  which 
is  equivalent  to  treating  the  two  image  planes  as  coin¬ 
cident  at  the  time  of  closest  approach  between  cameras 
(see  the  comments  preceding  equation  (3)).  Since  the  two 
cameras  must  be  physically  separated  by  some  small  but 
finite  baseline,  even  at  their  closest  approach,  it  is  really 
only  a  simplification  to  treat  the  two  image  planes  as 
momentarily  registered.  That  is,  the  image  coordinates  of 
a  given  feature  are  not  exactly  the  same  on  the  two 
image  planes.  The  remaining  disparity  between  coordi¬ 
nates  is  a  function  of  the  distance  to  the  feature,  exactly 
like  the  case  of  “static  stereo”.  However,  this  disparity 
should  not  be  a  real  problem  since  keeping  this  baseline 
small  will  lead  to  an  insignificant  disparity,  except  for 
objects  which  are  extremely  close. 

In  theory,  we  can  account  for  the  finite  baseline  at 
closest  approach  by  incorporating  a  small  correction  to 
the  image  coordinates  of  features  imaged  by  the  second 
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camera.  This  correction,  being  the  range  dependent 
disparity,  can  be  treated  as  a  small  perturbation  to  the 
ranging  formulas  derived  above.  This  would  lead  to  a 
"successive  approximation”  scheme  for  recovering  Z0,  in 
which  the  theory  developed  here  would  serve  as  the 
lowest-order  approximation,  to  which  corrections  can  be 
applied.  However,  given  the  inaccuracies  associated  with 
noisy  flow  values,  such  an  iteration  scheme  hardly  seems 
warranted. 

We  have  investigated  the  effects  of  a  Suite  baseline 
at  closest  approach  with  our  Image  Flow  Simulator  [14). 
An  artificial  scene  is  constructed  and  motion  parameters 
are  assigned  to  an  observer,  yielding  the  first  image  flow 
field  (obtained  from  second-order  flow  filtering  of  point 
feature  velocities).  The  second  flow  field  is  obtained 
using  the  same  scene  shifted  by  a  small  amount  in  a 
direction  parallel  to  the  image  plane  (in  order  to  simulate 
camera  separation),  and  different  motion  parameters  are 
assigned  to  the  observer.  The  two  flow  fields  are  then 
differenced  and  corrected  for  relative  rotation  in  order  to 
recover  range.  In  the  case  of  a  planar  surface  patch,  in 
the  absence  of  noise,  a  baseline  to  range  ratio  of  1/1000 
led  to  a  2 %  range  error.  A  ratio  of  5/1000  yields  about  a 
I0rc  error,  which  Ls  comparable  to  the  accuracy  obtain¬ 
able  with  10ro  noise.  It  would  seem  that  this  ratio  is 
reasonable  when  ranging  to  objects  at  several  hundred 
feet  or  more. 


4.  CONCLUDING  REMARKS 

We  have  introduced  a  new  concept  in  passive  rang¬ 
ing,  termed  Dynamic  Stereo,  which  is  based  on  the  com¬ 
parison  of  image  flow  fields  obtained  from  two  cameras  in 
known  relative  motion.  The  method  is  designed  for  rang¬ 
ing  to  moving  objects  in  an  evolving  scene.  For  stationary 
scenes,  this  technique  reduces  to  conventional  “motion 
stereo”.  In  this  paper  we  have  developed  the  basic  theory 
and  studied  the  effects  of  noisy  image  velocities  and  field 
of  view,  with  the  aid  of  an  Image  Flow  Simulator  [14|. 
These  simulations  suggest  that  ranging  to  points,  planes 
and  curved  surfaces  is  feasible,  even  in  the  presence  of 
10%  noise.  Two  methods  for  filtering  the  noise  were  also 
introduced,  “Radial  Flow  Filtering”  of  the  corrected 
difference  flow  and  “Second-Order  Flow  Filtering"  prior 
to  differencing.  The  second  method  is  prefered,  but  may 
only  be  applied  to  image  velocities  generated  by  a  single 
surface  patch. 

It  is  anticipated  that  Dynamic  Stereo  may  have  a 
number  of  interesting  applications.  First,  it  may  be  used 
in  conjuction  with  single  image  flow  analysis,  providing 
the  scale  factor  required  for  the  complete  recovery  of 
object  structure  and  space  motion  from  time-varying 
imagery  [I  I).  Second,  it  may  prove  useful  in  industrial 
robotics  for  handling  moving  workpieces  in  an  evolving 
workplace.  Two  cameras  would  have  to  be  configured  on 
different  parts  of  the  robot  arm  so  that  they  will  experi¬ 
ence  a  relative  motion.  Third,  Dynamic  Stereo  has  poten¬ 
tial  application  to  autonomous  vehicle  navigation  for 


ranging  to  other  moving  objects  in  the  scene.  Possible 
configurations  are  two  cameras  in  relative  motion  on  a 
land  vehicle,  or  one  camera  on  each  of  two  aircraft  in 
known  relative  motion. 

To  appreciate  the  distance  scales  involved  with  this 
approach  to  passive  ranging,  we  can  try  to  scale  up  our 
simulation  examples  to  the  case  of  land  vehicles.  A  vehi¬ 
cle  traveling  at  50  km/hr  moves  about  50  feet  in  one 
second.  Upon  the  vehicle  there  is  mounted  a  fixed  camera 
and  a  sliding  camera  which  moves  about  5  feet  during  the 
one  second  interval.  The  scale  factors  recovered  would 
then  correspond  to  a  distance  of  the  order  of  500  feet. 
Our  simulations  indicate  that  best  results  are  obtained 
when  the  sliding  camera  moves  at  an  angle  with  respect 
to  the  direction  of  vehicle  motion.  Implementation  of 
this  new  ranging  technique  on  actual  evolving  scenes, 
utilizing  cameras  in  relative  motion,  is  planned  for  the 
near  future. 
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Abstract 

The  31)  Mosaic  system  is  a  vision  system  that  incrementally  reconstructs 
complex  W  scenes  from  multiple  images.  The  system  encompasses  several 
levels  of  the  vision  prot  ess.  storting  with  images  ami  ending  with  symbolic 
scene  descriptions  This  paper  describes  the  various  components  of  the 
system,  including  stereo  analysis  monocular  analysis  and  constructing  and 
mollifying  the  scene  model.  In  addition,  the  representation  of  the  scene 
moilel  is  described.  This  model  is  intended  for  tasks  such  as  matching, 
display  generation,  planning  piillis  through  the  scene,  and  making  other 
decisions  about  the  scene  environment.  Examples  showing  how  the  system 
is  used  to  interpret  complex  aerial  photographs  of  urban  scenes  are 
presented. 

Each  view  of  the  scene,  which  may  be  either  a  single  image  or  a  stereo 
pair,  undergoes  analysis  which  results  in  a  3D  wire-frame  description  that 
represents  portions  of  edges  and  vertices  of  objects  The  model  is  a  surface 
based  description  constructed  from  the  wire  frames  With  each  successive 
view,  the  model  is  incrementally  updated  and  gradually  becomes  more 
accuiale  and  complete.  Task-specific  knowledge,  involving  block-shaped 
objects  in  an  urban  scene,  is  used  to  extract  the  wire  frames  and  construct 
and  uptime  the  model. 


1.  Introduction 

It  is  important  for  a  general  vision  system  to  derive  three-dimensional 
(3D)  information  about  a  given  scene  from  images  and  store  the 
information  in  a  cohc/cnt  manner  so  that  it  can  be  used  for  various 
matching,  planning,  and  display  tasks.  Our  goal  in  developing  the  31) 
Mosaic  system  has  been  to  build  a  full  vision  system,  that  is.  one  that 
goes  all  die  way  irom  images  to  symbolic  3D  descriptions,  f  urther,  we 
wanted  to  investigate  this  process  in  the  context  of  complex  scenes.  The 
result  is  really  a  first  pass  at  such  a  system,  and  provides  us  with  a  better 
understanding  of  the  components  required.  This  paper  describes  the 
system  and  presents  examples  of  how  it  is  used  to  interpret  complex 
aerial  photographs  of  urban  scenes. 

2.  The  3D  MOSAIC  System 

Hie  go.il  of  the  31)  Mosaic  system  is  to  obtain  an  understanding  of  the 
31)  configuration  of  surfaces  and  objects  in  a  scene.  The  significance  of 
this  goal  may  be  demonstrated  by  the  following  tasks. 

1.  Model-based  image  interpretation.  A  known  3D  scene  model 
can  provide  significant  aid  in  interpreting  arbitrary  images  of 
the  scene  |7. 19.  24).  'Ihc  31)  Mosaic  system  performs  the 
task  of  acquiring  such  a  model  of  the  scene. 

2. 3D  change  detection.  Change  detection  is  a  task  that 
determines  how  the  geometry  and  structure  of  a  scene 


changes  over  time.  The  conventional  approach  to  this  task 
involves  comparing  and  detecting  changes  in  images. 
However,  because  of  different  viewpoints  and  lighting 
conditions,  changes  in  the  images  do  not  necessarily 
correspond  to  changes  in  the  geometry  and  structure  of  the 
scene.  If  3D  scene  descriptions  were  obtained  from  the 
images  first,  such  descriptions  could  be  compared  in  31)  to 
determine  changes  in  the  scene. 

3.  Simulating  ;hc  appearance  of  the  scene.  If  a  3D  description  of 
the  scene  were  to  be  obtained,  displays  as  seen  from  arbitrary 
viewpoints  could  be  generated  from  it.  Ibis  is  useful  for 
tasks  such  as  familiarizing  personnel  with  a  given  area,  and 
flight  planning  by  generating  the  scene  appearance  along 
hypothetical  flight  paths. 

4.  Robot  navigation.  hrec-dimensionul  descriptions  of 
complex  environments  may  be  used  to  make  decisions 
dealing  with  path  planning  or  determining  which  parts  of  the 
env  ironment  to  analyze  in  more  detail. 

Ihe  3D  Mosaic  system  deals  with  complex,  real-world  scenes  (c.g..  Fig. 
4).  Dial  is,  the  scenes  contain  many  objects  with  a  variety  of  shapes,  the 
object  surfaces  have  a  variety  of  textures  and  reflectance  characterises, 
and  the  scenes  arc  imaged  under  ou'door  lighting  conditions.  Because  of 
the  complexity,  there  arc  many  difficulties  in  interpreting  the  images, 
including: 

1.  Any  particular  image  contains  only  partial  information  about 
the  scene  because  many  surfaces  arc  occluded. 

2.  Even  portions  of  the  scene  that  arc  visible  arc  often  difficult 
to  recover.  For  example,  surfaces  with  dark  shadows  cast 
across  them,  or  with  highlights,  may  be  difficult  to  interpret. 

Highly  oblique  surfaces  may  be  difficult  to  analyze  if  their 
resolution  in  the  image  is  poor. 

Our  approach  to  the  problems  of  complexity  is  to  use  multiple  images 
obtained  from  multiple  viewpoints.  Ibis  approach  aids  interpretation  in 
two  ways.  Hirst,  surfaces  occluded  in  one  image  may  become  visible  in 
another.  Second,  features  of  surfaces  that  are  difficult  to  analyze  and 
interpret  in  one  image  (such  as  scene  edges  and  texture)  may  become 
more  apparent  in  another  image  because  of  different  viewpoint  and/or 
lighting  conditions. 

2.1.  Incremental  Approach 

A  large  number  of  views  will,  in  general,  be  required  to  obtain  a  fully 
accurate  and  complete  description  of  a  complex  scene.  Typically,  all 
these  views  will  not  be  simultaneously  available,  while  some  may  never 
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become  available.  Many  of  them  will  only  be  obtained  gradually 
through  interaction  with  the  scene  environment.  Our  system  must 
therefore  have  the  ability  to  utilize  partial  descriptions  and  incrementally 
update  them  with  new  information  whenever  a  new  view  happens  to 
become  available.  As  a  practical  example,  consider  a  robot  (perhaps  a 
mobile  ground  robot  or  an  automatically  guided  airplane)  which  is 
attempting  to  navigate  through  an  unknown  environment.  The  robot 
would  sequentially  acquire  images  of  the  environment  as  it  moves  about. 
Information  derived  from  each  new  image  would  serve  to  update  its 
internal  model,  and  this  partial  model  would  be  used  to  decide  where  to 
go  next,  or  where  to  analyze  in  more  detail. 

We  have  adopted  an  approach  in  which  the  .3D  scene  model  is 
incrementally  acquired  over  the  multiple  views.  The  views  of  the  scene 
are  sequentially  acquired  and  processed.  Partial  31)  information  is 
derived  from  each  view.  The  initial  model  is  constructed  f.om  3D 
information  obtained  from  the  first  view,  and  represents  an  initial 
approximation  of  the  scene.  As  each  successive  view  is  processed,  the 
model  is  incrementally  updated  and  gradually  becomes  more  accurate 
and  complete. 

Most  previous  research  efforts  at  acquiring  3D  scene  descriptions  from 
multiple  views  have  dealt  with  relatively  simple  scenes  in  controlled 
environments^,  8, 9,  18,22,25].  This  has  led.  in  some  cases,  to  only 
utilizing  occluding  contours  in  the  image  to  form  the  3D  description 
[2,  S,  9, 18].  The  work  of  Moravee  [20]  deals  with  complex  indoor  and 
outdoor  scenes,  but  the  3D  descriptions  generated  by  his  system  consist 
of  sparse  sets  of  feature  points.  Our  system,  on  the  other  hand,  generates 
full,  surface-based  descriptions. 

2.2.  Overview 

A  flowchart  for  the  3D  Mosaic  system,  showing  the  major  modules 
and  data  structures,  is  displayed  in  Pig.  1.  The  input  is  a  new  view  of  the 
scene,  which  may  be  either  a  stereo  image  pair  or  a  single  image.  The 
stereo  pair  undergoes  stereo  analysis,  while  the  single  image  undergoes 
monocular  analysis.  Ihe  purpose  of  these  analyses  is  to  obtain  3D  scene 
features  such  as  portions  of  surfaces,  edges,  and  comers. 

Ihe  central  scene  model  is  a  surface-based  description  which  is 
constructed  and  modified  from  Uicse  features.  Itcforc  modifications  to 
the  scene  model  can  occur,  die  3D  features  from  the  new  view  must  be 
matched  to  the  current  model.  Ihe  scene  model  may,  at  any  point  along 
its  development,  be  used  for  tasks  such  as  image  interpretation,  planning, 
or  display  generation.  A  new  view  may  then  be  acquired  which  may 
further  modify  the  model. 

t-or  example,  when  the  stereo  analysis  component  is  applied  to  the 
images  in  1-ig.  4,  the  result  is  the  set  of  wire  frames  in  Pig.  9.  The  scene 
model  constructed  from  these  wire  frames  is  shown  in  Pig.  20,  When  the 
monocular  analy s  s  component  is  applied  to  the  image  in  Fig.  10,  die 
result  is  the  set  of  wire  frames  in  Pig.  17.  These,  in  turn,  arc  converted 
into  the  scene  model  in  Pig.  21.  Finally,  the  result  of  modifying  the 
model  in  Pig.  20  with  a  new  view  is  shown  in  P'ig.  27. 

3.  Stereo  Analysis 

Most  stereo  matching  methods  involve  matching  low-level  image 
features,  such  as  image  intensities  (3,  13,  17,21]  or  image  edge  points 
(3. 12.  21|.  Points  to  be  matched  may  also  be  chosen  as  "interesting 
points",  e  g.,  those  with  high  variance  in  all  directions  [6,  20],  Our 
method  involves  matching  structural  features  -  i.e„  junctions  -  extracted 
from  the  images.  There  arc  several  reasons  for  this. 


PiRure  I:  31)  Mosaic  ilowchan.  The  dashed  lines  represent 
components  that  have  noi  yet  been  implemented:  ihe  solid  lines 
represent  components  already  implemented 


Pirst,  feature-based  matching  results  in  more  accurate  3D  positions  for 
occlusion  boundaries  than  gray  scale  area  matching.  Second,  by 
extracting  3D  information  dealing  with  scene  vertices  and  edges 
emanating  from  them,  we  obtain  portions  of  boundaries  of  scene 
builuings.  particularly  building  corners.  lTiese  boundaries  are  then  used 
to  construct  3D  approximations  of  the  buildings. 

Finally,  because  of  our  wide-angle  stereo  images,  there  are  large 
disparity  jumps  and  large  portions  of  the  scene  arc  visible  in  one  image 
but  not  the  other.  Because  most  stereo  systems  do  not  distinguish  these 
from  other  regions  of  the  image,  they  try  to  find  matches  for  them  and 
therefore  have  trouble  (3,  5. 6, 12, 13, 17], 

In  our  approach,  rather  than  attempting  to  find  matches  for  scene  faces 
occluded  in  one  of  the  images,  we  match  face  boundaries  visible  in  both 
images.  We  do  this  by  explicitly  taking  into  account  the  way  junction 
appearances  change  from  one  image  to  the  other,  using  the  knowledge 
that  in  urban  scenes,  roofs  of  buildings  lend  to  be  parallel  to  the  ground 
plane,  while  walls  lend  to  be  perpendicular  to  this  plane.  I-ldgcs  in  the 
scene  perpendicular  to  the  ground  will  appear  in  each  image  to  be 
directed  towards  die  vertical  vanishing  point  (16]. 

If  a  feature  in  an  image  lies  on  a  roof,  its  appearance  in  the  other  image 
as  a  function  of  position  along  the  cpipolur  line  can  be  predicted  if  the 
normal  to  the  ground  plane  is  known.  I'o  sec  why,  consider  Pig.  2. 
Suppose  die  junction  PjPjP^  in  imagcl  is  given,  and  our  goal  is  to 
predict  the  junction  QjQ,Q2  in  imagc2.  where  the  point  Q,  lies 
anywhere  (inside  the  infinity  point)  on  the  cpipolar  line  corresponding  to 
P| .  For  the  position  Qr  the  3-space  position  of  Vj  can  be  computed  as 
the  intersection  of  die  rays  through  Pt  and  Qr  This  uniquely  determines 
the  position  of  the  plane  parallel  to  the  gruund  diat  contains  V(.  The 
3-sp.icc  positions  of  the  points  V2  and  V3  can  now  be  computed  as  the 
intersections  of  diis  plane  with  the  rays  corresponding  to  the  points  P2 
and  Pj,  respectively.  Finally,  the  points  Q2  and  Q}  ire  uniquely 
determined  as  central  projections  of  the  points  V,  and  V3,  respectively. 

Therefore,  when  an  I.  junction  is  found  in  one  image,  it  is  initially 
assumed  to  arise  fiom  a  coi  ner  of  a  roof,  and  its  appearance  in  the  other 
image  can  be  predicted.  When  an  ARROW  or  PORK  junction  is  found. 


infinity.  The  focal  length  and  vertical  vanisning  point  are  currently 
manually  obtained. 


V, 


Epipofar  line 


figure  2:  l  or  junction  I'.Fyy  ilv  appearance  in  musc2  can  be 
predicted  as  a  function  of  position  Q.  along  the  cpipolit  line.  Ihe 
normal  to  plane  V.VjVj  must  be  known. 


figure  3:  31tc  vector  from  the  focal  point  to  the  vertical  vanishing 
point  is  a  3-soacc  vector  in  the  vertical  direction. 

the  leg  of  the  junction  directed  towards  the  vertical  vanishing  point  is 
initially  assumed  to  arise  from  a  scene  edge  perpendicular  to  the  ground, 
while  the  other  two  legs  are  initially  assumed  to  arise  from  scene  edges 
lying  on  a  roof  or  on  the  ground.  Again  its  appearance  can  be  predicted. 

Structural  relationships  between  scene  vertices  arc  also  used  to  aid  in 
the  matching,  if  two  junctions  in  an  image  arise  from  scene  vertices  at  the 
same  height  above  the  ground,  the  positions  of  the  corresponding 
junctions  in  the  other  image,  as  a  function  of  position  along  the  cpipolar 
line,  can  be  predicted  if  the  normal  to  the  ground  plane  is  known.  This 
can  be  shown  using  similar  arguments  as  before.  In  Fig.  2,  pretend  that 
the  points  P  0,-  and  V  correspond  to  positions  of  separate  junctions  and 
vertices.  For  example,  if  P,  and  P,  are  two  separate  junctions  in  imagel, 
then  for  some  point  Q(  on  the  cpipolar  line  corresponding  to  Pj,  the 
position  of  the  junction  Qy  corresponding  to  P?.  can  be  predicted  if  Vj 
and  V3  arc  assumed  to  lie  at  the  same  height.  We  make  the  assumption 
that  junctions  close  to  one  another  in  the  image  often  correspond  to 
vertices  lying  on  top  of  the  same  building  and  therefore  have 
approximately  the  same  height 

These  matching  techniques  assume  that  the  vector  normal  to  the 
ground  plane  is  known.  To  obtain  this  vector,  we  form  a  vector  from  the 
focal  point  to  the  vertical  vanishing  point.  As  shown  in  Fig.  3,  this  results 
in  a  3-space  vector  in  the  vertical  direction  (4|,  since  a  line  containing  the 
focal  point  and  vertical  vanishing  point  intersects  any  vertical  line  at 


3.1 .  Steps  in  Stereo  Analysis 

We  now  provide  an  example  showing  how  the  stereo  analysis  is 
performed  on  the  stereo  pair  of  images  in  Fig.  4.  First,  linear  features  are 
extracted  by  finding  edge  points,  thinning  and  linking  them,  and  fitting 
piecewise  linear  segments,  t  he  resulting  line  images  are  shown  in  F'ig.  5. 
Next,  junctions  are  extracted  by  placing  a  5x5  window  around  each  end 
point  of  each  line  and  searching  for  ends  of  other  lines.  Junctions  that 
have  been  found  are  labeled  in  Fig.  5.  (See  |15]  Tor  more  details.)  Notice 
that  many  of  the  junctions  correspond  to  building  corners. 

We  now  want  to  find  potential  junction  matches  between  the  two 
images,  l.ct  us  consider  how  I.  junctions  are  matched.  Kach  I.  junction 
is  initially  assumed  to  lie  on  a  horizontal  scene  plane.  ITic  shape  and 
orientation  of  its  corresponding  junction  in  the  other  image,  as  a  function 
of  position  along  the  cpipolar  line,  can  therefore  be  predicted.  Fi>ch  L 
junction  in  the  first  image  may  therefore  usually  be  matched  with  several 
junctions  in  the  second  image  that  have,  within  tolerance,  the  predicted 
shape  and  orientation.  However,  we  do  not  try  to  match  only  with 
junctions  in  the  second  image  that  have  been  previously  found.  Rather, 
for  every  point  on  the  cpipolar  line  (on  the  appropriate  side  of  the 
infinity  point),  a  search  is  made  within  a  pre-spccificd  window  for  lines 
that  might  correspond  to  the  predicted  junction.  Ihc  requirements, 
however,  for  two  lines  to  form  a  junction  is  more  relaxed  than  the 
requirements  during  initial  junction  search.  The  matching  is  performed 
in  two  directions,  from  the  first  image  to  the  second,  and  vice  versa. 


At  this  point,  each  junction  in  one  image  is  associated  with  a  set  of 
potentially  matching  junctions  in  the  other  image.  The  next  step  is  to 
find  the  best  of  the  potential  matches,  resulting  in  a  single  match  for  each 
junction.  Two  criteria  are  used  in  determining  the  best  matches: 


l.  If  the  image  intensities  inside  two  potentially  matching 
junctions  arc  similar,  the  likelihood  that  they  really  match  is 
increased.  This  is  because  the  two  junctions  will  often  have 
similar  intensities  if  they  arise  from  the  same  face  corner.  To 
measure  the  degree  of  similarity,  we  compute  the  average 
intensities  of  regions  along  the  two  legs  of  the  L  junction  in 
each  image.  As  depicted  in  Fig.  6,  let  A  and  B  be  the  average 
intensities  of  these  regions  in  one  image,  and  let  A'  and  IV  be 
the  average  intensities  of  corresponding  regions  in  the  other 
image.  Then  the  degree  of  similarity,  called  the  local  cost,  is 
defined  as 


Clocal  -  I  A  -  A’  |  +  |  B  -  B- 1- 
2.  As  described  previously,  if  two  junctions  in  an  image  arise 
from  scene  vertices  that  arc  at  the  same  height,  the  relative 
positions  of  the  corresponding  junctions  in  the  other  itnagc, 
as  a  function  of  position  along  the  cpipolar  line,  can  be 
predicted.  Wc  use  this  to  determine  whether  two  sets  of 
junction  matches  are  consistent  with  one  another.  Suppose, 
in  Fig.  7,  that  the  junctions  J:  and  J2  in  imagel  arise  from 
scene  vertices  that  arc  at  the  same  height.  Suppose  also  that 
the  junction  matches  (Jj.  T,)  and  (J2.  J'2)  have  been 
hypothesized.  To  measure  the  degree  of  consistency  between 
these  two  sets  of  matches,  wc  predict  the  position  of  llie 
junction  in  magc2  that  corresponds  to  (say)  J2.  l.ct  us  refer 
to  the  predicted  position  as  J"2.  If  the  vector  from  J'j  to  J”2 
is  ( ar  bj)  and  the  vector  from  J'j  to  J'2  is  (<i2  b2),  then  the 
degree  of  consistency  between  the  two  sets  of  matches,  called 
the  global  cost,  is  defined  as 


Slobal  “  I  al 


a2\  +  \b/-b2\ 


To  arrive  at  a  unique  set  of  junction  matches,  the  space  of  potential 
matches  is  searched  using  a  beam  search  (24).  which  is  guided  by  the 
above  two  criteria.  The  results  of  this  search  arc  displayed  in  Fig.  8. 
which  shows  junctions  in  one  image  that  have  matches  in  the  other 
image. 

Finally.  31)  coordinates  of  vcitices  and  equations  of  edges  are  derived 
using  triangulation.  Fig.  9  shows  a  perspective  view  of  the  3D  vertices 
and  edges  that  result  We  call  this  a  wire  frame  description  of  the  scene. 

4.  Monocular  Analysis 

Although  stereo  is  a  major  source  of  3D  in  formation,  some  views  of  the 
scene  will  be  only  single  images.  We  can  also  extract  3D  information 
from  these  images  by  exploiting  task -specific  knowledge.  We  assume  that 
the  objects  in  the  scene  are  trihedral  polyhedra  containing  only  vertical 
and  horizontal  faces,  i.c..  faces  perpendicular  and  parallel,  respectively, 
to  the  ground  plane.  Our  monocular  analysis  extracts  linear  structures  in 
the  image  that  represent  boundaries  of  buildings,  and  then  converts  these 
structures  into  3D  w  ire  frames. 

4.1.  Steps  in  Monocular  Analysis 

this  section  provides  an  example  showing  how  the  monocular  analysis 
is  performed  on  the  image  in  Fig.  10. 


I  iRure  4:  Gray  scale  stereo  images 


Krjtyting  lines  and  juntlions.  Ihc  first  step  is  to  extract  linear 
segments  and  junctions  from  the  image.  Ihc  method  used  here  is  the 
same  as  that  used  during  stereo  analysis  (is  previously  described).  First 
thinned  edge  points  arc  found,  and  then  lines  and  junctions  are 
extracted,  as  shown  in  Fig,  11. 

I, pealing  21)  structures.  Next  we  form  linear  connected  stucturcs  in 
the  image  by  hypothesizing  new  lines  to  connect  die  previously  extracted 
junctions.  Ihcsc  connected  structures  arc  meant  to  represent  building 
boundaries  and  the  hypothesized  lines  are  meant  to  correspond  to 
building  edges.  Ihc  process  of  hypothesizing  connecting  lines  consists  of 
two  steps.  First,  two  junctions  may  be  connected  only  if  a  leg  of  one 
points  at  the  other,  that  is.  the  extended  leg  meets  the  other  junction.  For 
each  pair  of  junctions  that  passes  this  test,  a  line  showing  the  connection 
between  the  two  'unctions  is  drawn  in  Fig.  12. 

The  second  step  involves  determining  which  connection',  shown  in  Fig. 
12  appear  as  connections  in  the  line  image  (Fig.  11).  For  each  pair  of 
connected  junctions  J,  and  Jk.  we  find  all  segments  in  the  line  image  that 
arc  contained  within  a  thin  rectangular  window  connecting  J,  and  .',,  and 
piu  i  these  segments  onto  the  line  connecting  the  two  junctions.  Then 
we  consider  how  much  of  this  line  is  covered  by  projected  segments.  The 
connection  between  J,  and  Jk  is  retained  only  if  the  percentage  of 
coverage  exceeds  a  threshold.  After  this  pruning  step,  the  junction  legs 
originally  extracted  in  the  junction  finding  step  arc  added,  and 
extraneous  legs  are  deleted.  The  final  connected  structures  arc  displayed 
in  Fig.  13. 


a  region  of  Washington.  D  C 


Figure  5:  Ruing  tinea:  segments  after  extracting.  thinning,  and 
linking  edge  poims  Juncuons  are  classified  as  L  A  (arrow),  F  (fork), 
or  T 
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I  iRurc  6:  Intensities  of  corresponding  regions  of  l  junctions  in  the 
two  images  arc  used  to  compute  the  local  matching  cost 


Figure  1:  Positional  vectors  of  predicted  and  actual  positions  of  two 
junction  matches  arc  used  to  compute  the  global  matching  cost 


Figure  8:  Matches  that  have  been  found  for  the  junctions  in  Fig.  5 
Actually,  not  alf  matches  arc  correct  l  or  example,  although  the 
junction  matches  (J1.J2)  and  (J3J4)  arc  correct,  the  match  (J5J6)  is 
incorrect 


Obtaining  31)  wife  frames.  Hie  next  step  is  to  convert  the  2D  structures 
into  3D  wire  frames.  First,  the  lines  that  form  the  2D  structures  are 
labeled  as  cither  "vertical"  or  "horizontal"  depending  on  whether  or  not 
they  are  directed  toward  the  vertical  vanishing  point  |16).  Next,  we  use 
the  position  of  the  vertical  vanishing  point  to  calculate  the  vector  in  the 
vertical  direction,  as  described  in  an  carlic  section.  Let  us  now  consider 
how  to  recover  the  3D  configuration  of  the  junction  PiPJV,  in  Fig.  14. 

Suppose  lli.it  line  pj>,  has  been  labeled  "vertical"  and  lines  p,p,  and  pj^ 
have  been  labeled  "horizontal”,  l  et  u  be  the  unit  vector  in  llic  vertical 
direction.  This  vector  is  normal  to  all  horizontal  planes.  First  we  would 
like  to  determine  the  3  spacc  position  of  v2,  corresponding  to  the  junction 
point  p;.  Since  it  is  impossible  to  determine  the  actual  position  of  this 
point  from  a  single  image  without  special  information,  the  position  is 
determined  as  some  arbitrary  point  lying  on  the  ray  through  p;,  i.c.,  the 
depth  a  of  v,  is  arbitrarily  chosen,  llic  horizontal  plane  v,V)V,  can  now  be 
established,  since  It  contains  v,  and  its  normal  vector  is  u.  Ihc  3-spacc 
positions  of  the  points  v,  and  v,  can  then  be  computed  as  the 
intersections  of  this  plane  with  the  rays  through  p.  and  p,.  respectively. 
Finally,  the  3-space  position  of  the  point  v4  is  computed  as  the 
intersection  of  the  ray  through  p4  with  the  line  through  >  along  the 
vector  u. 

Although  this  technique  permits  us  to  recover  the  3D  configuration  of 
any  junction  relative  to  some  arbitrary  depth,  it  is  not  useful  to  apply  it 
directly  to  the  junctions  in  the  original  line  image  (Fig.  11)  because  the 
relative  heights  above  the  ground  plane  of  die  correspo;  ding  vertices 
cannot  be  determined:  the  height  of  each  vertex  is  arbitrarily  chosen 
without  relation  to  the  heights  of  other  vertices.  It  is  more  useful, 
however,  to  apply  the  technique  to  the  2D  structures  in  Fig.  13.  since  the 
heights  of  the  vertices  widun  each  structure  can  be  related.  To  see  how 
this  is  done,  consider  the  example  in  Fig.  15.  which  shows  a  2D  structure. 
Suppose  lines  p,p6  and  pj>,  have  been  labeled  "vertical",  while  the  other 
solid  lines  have  been  labeled  "horizontal".  Applying  our  technique  to 
(say)  point  p,.  the  3-spacc  positions  of  the  vertices  corresponding  to 
points  P|,  pj.  and  p„  can  be  determined  relative  to  some  arbitrary  depth  a 
for  p,.  If  the  technique  is  applied  next  to  point  p;.  the  3-space  position  of 
point  p,  can  be  Jcicrmined  as  a  function  of  the  depth  a.  This  procedure 
continues  with  points  pt.  p4.  and  so  on.  until  the  3D  configuration  of  the 
whole  structure  has  been  determined,  relative  to  some  arbitrary  depth. 

In  order  to  obtain  a  coherent  scene  description,  the  depths  of  the 
different  structures  in  the  scene  must  be  related.  We  use  two  methods  to 
do  this.  The  first  method  involves  finding  structures  that  lie  on  the 
ground  plane.  Suppose  a  junction  point  p  of  such  a  structure  is 
hypothesized  to  arise  from  a  vertex  lying  on  the  ground.  Then  the  3- 
spacc  position  of  the  vertex  may  be  obtained  as  the  intersection  of  the 
ground  plane  with  the  ray  through  p.  Ihc  normal  vector  u  to  the  ground 
plane  is  known,  but  the  distance  J  from  the  focal  point  to  the  ground 
plane  is  arbitrarily  chosen.  Since  the  3-space  position  of  all  junctions 
arising  from  ground  points  can  be  calculated  in  this  manner,  the  depths 
of  all  structures  containing  such  points  can  be  related  to  one  another 
through  llic  parameter  d. 

To  hypothesize  junctions  that  arise  from  vertices  lying  on  the  ground 
plane,  we  use  the  observation  that  if  a  line  labeled  "vertical"  connects 
two  junctions  (c.g..  line  p,p6  in  Fig.  15),  the  line  is  directed  toward  th 
vertical  vanishing  point  with  respect  to  one  junction,  but  away  from  this 
vanishing  point  with  respect  to  the  other  junction.  Ihc  latter  junction  is 
assumed  to  represent  a  vertex  lying  on  the  ground  plane.  Points  p,  and  p, 
in  Fig.  15  arc  examples  of  such  junctions.  Ihc  3-space  positions  of  these 
junctions  arc  then  calculated,  and  their  values  arc  propagated  throughout 
their  structures  as  described  previously. 
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Figure  10:  Aerial  photograph  showing  pan  of  Washington.  D  C.  This 
us  a  different  view  of  the  same  scene  as  in  Fig.  4. 


Figure  II:  lines  fitted  to  edge  points  extracted  from  Fig.  10. 
Junctions  in  the  image  are  classified  as  L  A  (arrow).  F  (fork),  or  T. 


Figure  12:  Each  line  represents  a  possible  connection  between  the 
junctions  at  its  two  end  points  Each  end  point  corresponds  to  a 
junction  in  Fig.  11. 


Figure  13:  Result  after  pruning  junction  connections  in  Fig.  12.  and 
adding  junction  legs  originally  extracted  in  the  junction  finding  step. 

V1 


point 

Figure  14:  The  3D  configuration  of  the  junction  PlP2PyP*  030  ** 
recovered  under  assumptions  explained  in  the  text 


Figure  15:  The  solid  lines  represent  a  connected  2D  structure.  The 
dashed  lines  are  for  the  reader’s  convenience  to  make  the  3D  shape 
more  apparent 


There  arc  many  structures  in  Kig.  13  that  do  not  contain  points  lying 
on  tlie  ground  plane.  Nevcrthlcss.  the  heights  of  some  of  these  structures 
can  be  determined  using  die  rule  diat  if  two  lines  arc  aligned  in  the 
image,  they  arc  often  aligned  in  3-spucc.  Suppose,  in  Fig.  16,  that  points 
through  p.  have  already  been  assigned  3D  coordinates,  and  we  want  to 
obtain  the  3-space  position  of  the  2D  structure  ft/y>inpa.  Since  die  lines 
PJ>  and  /y>„  arc  aligned  in  the  image  and  both  are  labeled  "horizontal", 
they  arc  assumed  to  be  aligned  in  the  scene  and  to  lie  in  the  same 
horizontal  plane,  ihe  3-space  position  of  (say)  point  pt  is  therefore 
determined  as  the  intersection  of  this  plane  with  the  ray  through  p,.  'Die 
31)  coordinates  of  this  point  may  dicn  be  propagated  to  points  p„  p10,  and 
A;  as  described  previously.  Note  that  all  3D  positions  are  functions  of 
the  parameter  d.  which  is  arbitrarily  chosen  for  the  equation  of  the 
ground  plane. 

Hg,  17  depicts  a  perspective  view  of  the  3D  wire  frames  obtained  using 
these  methods. 


l-icurt  16:  If  the  111  configuration  of  the  structure  on  the  left  has 
been  dcteimmed.  the  relative  311  position  of  the  structure  on  the 
right  may  also  be  determined  because  lines  and  pipll  arc  aiigned 


Figure  17:  Perspective  view  of  3D  wire  frames  generated  from  Fig. 
13. 


5.  Representing  and  Manipulating  the  30  Scene 
Model 

The  represenfation  wc  have  developed  for  the  3D  scene  model  draws 
on  ideas  from  geometric  modelling  used  in  computer-aided  design 
systems  (1.  23|.  In  these  systems,  however,  the  3D  models  arc  usually 
derived  through  interaction  with  a  user.  Our  case  is  different  in  that  (1) 
the  3D  models  are  derived  automatically  from  2D  images,  and  (2)  many 
portions  of  the  scene  arc  unknown  or  recovered  with  errors  because  of 
occlusions  or  unreliable  analysis. 


Die  following  factors  have  determined  how  the  scene  model  is 

represented  and  manipulated. 

1.  Partially  complete,  planar-faced  objects  must  be  efficiently 
described  by  the  model.  It  is  therefore  represented  as  a  graph 
in  terms  of  symbolic  primitives  such  as  faces,  edges,  vertices, 
and  their  topology  and  geometry.  Information  is  added  and 
deleted  by  means  of  these  primitives. 

2.  The  model  must  be  easy  to  use  in  matching. 

3.  Because  scene  approximations  arc  often  more  useful  if  they 
contain  reasonable  hypotheses  for  parts  of  the  scene  for 
which  there  are  partial  data,  wc  introduce  mechanisms  that 
permit  hypotheses  to  be  generated,  added,  and  deleted. 

4.  Because  incremental  modifications  to  the  model  must  be  easy 
to  perform,  wc  introduce  mechanisms  to  (a)  add  primitives  to 
the  model  in  a  manner  such  that  constraints  on  geometry 
imposed  by  these  additions  arc  propagated  throughout  the 
model,  and  (b)  modify  and  delete  primitives  if  discrepancies 
arise  between  newly  derived  and  current  information. 

The  3D  structure  in  the  scene  is  represented  in  the  form  of  a  graph, 
called  the  structure  graph.  \Tic  nodes  and  links  represent  primitive 
topological  and  geometric  constraints.  The  structure  graph  is 
incrementally  constateted  through  the  addition  of  these  constraints.  As 
constraints  are  accumulated,  their  effects  are  propagated  to  other  parts  of 
the  graph  so  as  to  obtain  globally  consistent  interpretations. 

Nodes  in  the  structure  graph  represent  either  primitive  topological 
elements  (i.e..  faces,  edges,  vertices,  objects,  and  edge-groups  (which  are 
rings  of  edges  on  faces))  or  primitive  geometric  elements  (i.c..  planes, 
lines,  and  points).  Face.  edge,  and  vertex  nodes  are  tagged  as  either 
confirmed  or  unconfirmed.  Confirmed  means  that  the  element 
represented  by  the  node  has  been  derived  directly  from  images. 
Unconfirmed  means  that  the  clement  has  only  been  hypothesized. 

The  primitive  geometric  elements  serve  to  constrain  the  3-space 
locations  of  faces,  edges,  and  vertices.  Plane  and  line  nodes  contain 
plane  and  line  equations,  respectively.  Point  nodes  contain  coordinate 
values,  lire  structure  graph  contains  two  types  of  links:  the  part-of  link, 
representing  the  part/whole  relation  between  two  topological  nodes,  and 
the  geometric  constraint  link,  representing  the  constraint  relation 
between  a  geometric  and  topological  node. 

Fig.  18  shows  a  simple  example  of  a  structure  graph  consisting  of  two 
objects,  obi  and  oW.  Arrows  with  single  lines  represent  part-of  links,  and 
arrows  with  double  lines  represent  geometric  constraint  links.  Die  faces 
arc  represented  as  ,  the  edge-groups  as  .  the  edges  as  e  ,  and  the 
vertices  as  v  .  The  graph  shows  one  point  node  pi  and  one  plane  node  pi. 
Further  details  on  representing  and  manipulating  the  3D  Scene  model 
may  be  found  in  |15. 14]. 

6.  Generating  the  3D  Scene  Model 

lire  result  of  image  analysis  is  a  3D  wire  frame  description  that 
represents  3D  vertices  and  edges  which  correspond  to  portions  of 
boundaries  of  objects  in  the  scene.  We  construct  a  surface-based 
description  -  the  3D  scene  model  -  from  these  boundaries  by 
hypothesizing  new  vertices,  edges,  and  faces  using  task-specific 
knowledge.  Some  of  the  rules  used  here  will  be  described  next,  and  will 
be  illustrated  on  tire  wire  frames  in  Fig.  9. 
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Figure  18:  Simple  example  of  a  structure  graph  consisting  of  two 
objects,  obi  and  ob2. 


Figure  19:  Obtaining  a  surface-based  description  from  wire  frames. 


Koch  adjacent  pair  of  legs  ordered  around  a  wirc-framc  vertex  is 
assumed  to  correspond  to  the  corner  of  a  planar  face.  A  partial  face, 
called  a  writ  face,  is  generated  for  each  such  pair  (Fig.  19a).  Next,  web 
faces  that  represent  corners  of  a  single  face  arc  merged.  Web  faces  may 
cidicr  be  touching  (e.g..  Fig.  19b,  and  FI  and  F2  in  Fig.  9)  or  non- 
touching  (c.g..  Fig.  19c.  and  F3  and  F4  in  Fig.  9).  When  merging  two 
non-touching  faces,  the  two  edges  on  which  each  matching  pair  of  end 
points  lie  are  extended  in  space  and  intersected.  The  intersection  points 
form  two  new  vertices  on  the  resulting  face. 

Incomplete  faces  are  then  completed  cither  as  parallelograms  (for  faces 
consisting  of  a  single  corner  (Fig.  19dl)  or  as  polygons  (for  faces 
containing  three  or  more  connected  edges  (Fig.  19c)).  Next,  one  face  is 
assumed  to  represent  a  hole  in  another  face  if  ( 1 )  the  planes  of  the  faces 
arc  nearly  parallel  and  close  to  each  other,  and  (?)  the  boundary  of  the 
first  face,  when  projected  onto  the  plane  of  the  second  face,  falls  inside 
the  boundary  of  that  face  (Fig.  19f). 

At  this  point,  many  objects  will  be  only  partially  complete  because  they 
are  not  closed.  Since  we  arc  dealing  with  urban  scenes,  faces  that  lie  high 
enough  above  the  ground  are  assumed  to  represent  roofs  of  buildings.  A 
hypothesi/cd  vertical  wall  is  dropped  towards  the  ground  from  each  edge 
of  such  faces,  unless  the  edge  is  already  part  of  another  face  (Fig.  19g). 
Fach  wall  is  dropped  either  to  the  ground  or  to  the  first  face  it  intersects 
on  the  way  down. 


Fig.  20  shows  perspective  views  of  the  resulting  scene  model.  Notice 
that  one  of  the  buildings  has  a  hole  in  it.  through  the  roof.  The  planar 
patches  at  the  "front”  of  the  scene  arc  part  of  the  ground.  Fig.  21  shows 
the  scene  model  generated  when  these  techniques  arc  applied  to  the 
wirc-framc  description  obtained  using  monocular  analysis  (Fig.  1?). 

In  order  to  render  more  realistic  displays,  gray  scale  is  added  to  them 
1 10|.  I  bis  is  useful  for  realistically  simulating  the  appearance  of  the 
scene  from  arbitrary  viewpoints.  We  associate  with  each  face  in  the 
model  an  intensity  patch  obtained  from  the  image.  For  faces  that  are 
partially  occluded  in  die  image,  the  intensity  patch  is  associated  with  the 
visible  portions.  '-igs.  22  and  23  show  the  results  of  adding  gray  scale  to 
the  faces  of  the  models  in  Figs.  20  and  21,  respectively. 


••xxxxo  wiui  c^urrem  Model 

I  he  process  of  incorporating  a  3D  w, re-frame  description  extracted 

from  a  new  new  into  the  current  scene  model  can  be  divided  into  three 
main  steps:  v 


1.  Hie  wire  frame  data  must  first  be  matched  to  the  current 
model.  I  his  process  provides  (a)  the  scale  transformation 
and  coordinate  transformation  from  the  wire-frame  data  to 
the  mode),  and  (b)  corresponding  elements  fi  e.,  vertices  and 
edges)  in  the  two. 


I  ijjurc  2(1:  Perspective  views  of  buildings  reconstructed  from  wire¬ 
frame  data  in  I  ig  9 


Nurc  22:  Reconstructed  buildings  of  Mg.  20  with  gray  scale  r.  „  u 

derived  from  the  left  -mage  in  Mg  4.  mapped  onto  faces  l  igurc  1>:  Reconstructed  buildings  of  Mg  21  with  gray  scale. 

derived  from  Mg  10.  mapped  onto  faces 
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2.  Ihc  nc»  wre-frame  data  is  then  merged  with  the  current 
mode!  This  p.occss  includes  (a)  merging  pairs  of 
corresponding  elements,  and  (b)  adding  to  the  mode!  wire¬ 
frame  elements  for  which  no  correspondences  were  found. 
During  the  merging  process,  hypothesized  parts  of  the  model 
that  are  inconsistent  with  the  new  wire-frame  data  arc 
deleted. 

3.  At  this  point,  many  objects  in  the  model  may  be  incomplete 
because  (a)  new  wire-frame  data  has  been  added,  and/or  (b) 
some  hypothesized  elements  have  been  deleted.  These 
objects  are  completed  using  die  techniques  described  in  the 
previous  section. 

To  sec  how  these  steps  arc  carried  out,  consider  the  example  of 
incorporating  the  information  from  a  second  view  into  the  scene  model 
of  Fig.  20.  This  scene  model  was  constructed  from  the  set  of  wire  frames 
(Fig.  9)  automatically  extracted  from  a  "front"  view  of  the  scene  (Fig.  4). 
Ihc  second  set  of  wire  frames,  shown  in  Fig.  24,  was  manually  generated 
to  simulate  information  available  from  an  opposing  point  of  view 
(viewing  the  scene  from  die  "back").  Notice  dial  die  information  in  Fig. 
9  emphasizes  edges  and  vertices  facing  die  front  of  die  scene,  while  those 
facing  tile  back  of  the  scene  are  emphasized  in  F  ig.  24. 


figure  24:  Hcrspecioc  view  of  manually  generated  vertices  and 
edees  Ihc  vicwpomi  for  this  drawing  is  chosen  in  be  similar  lo  lag.  9 
Points  PI.  P2.  and  PJ.  for  example  correspond  lo  poind  PI,  P2,  and 
P3  in  Pig.  9 


We  assume  in  this  example  that  the  scale  and  coordinate 
transformations  from  the  new  wire  frame  data  to  the  current  model  is 
known.  Next,  corresponding  edges  and  vertices  in  the  data  md  model  are 
obtained,  as  described  elsewhere  [15,  14J. 

7.1 .  Discrepancies 

We  must  now  merge  the  new  w  ire-frame  data  into  the  model.  An 
important  issue  here  is  how  to  handle  discrepancies  between  the  two.  We 
consider  the  following  two  types  of  discrepancies: 

1.  After  the  coordinate  system  of  the  wire  frame  data  has  been 
transformed  lo  that  of  the  mode!  and  scale  adjustments  have 
been  made,  corresponding  pairs  of  confirmed  vertices  and 
edges  may  not  register  perfectly  in  3-space.  In  order  to  merge 
them  into  single  elements,  wc  perform  a  "weighted 
averaging"  of  their  positions. 

2  Hypothesized  elemen's  in  the  model  may  be  inconsistent 
with  newly  obtained  elements.  We  handle  this  by  deleting 
such  hypothesized  elements. 


To  determine  whether  or  not  hypotheses  arc  still  valid  when  confirmed 
elements  in  the  model  are  modified  or  deleted  we  consider  the  elements 
which  gave  rise  to  the  hypotheses.  A  hypothesis  is  dependent  on  all 
elements  whose  existence  directly  resulted  in  the  creation  of  the 
hypothesis.  If  one  of  these  elements  is  modified  or  deleted,  the 
hypothesis  must  also  be  modified  or  deleted  since  the  conditions  under 
which  it  was  created  are  no  longer  valid.  The  dependency  relationships 
for  hypothesized  elements  arc  explicitly  recorded  at  the  time  of  their 
creation  using  dependency  pointers  [11]. 

The  following  examples  show  how  some  of  these  relationships  are 
recorded: 

1.  When  two  non-touching  partial  faces  are  merged.  (Fig.  25a) 
each  face  has  two  edges  which  are  intersected  with  their 
counterparts  in  the  other  face.  The  intersection  points  form 
two  new  hypothesized  vertices,  each  of  which  is  dependent 
on  the  two  edges  whose  intersection  gave  rise  to  it.  In  F'ig. 

25a,  vertex  vl  is  dependent  on  edges  el  and  eJ,  and  vertex  v2 
is  dependent  on  edges  e2  and  e4.  I  f  one  of  the  edges  were  to 
be  modified  (c.g„  if  its  position  were  to  be  displaced),  the 
vertex  that  depends  on  that  edge  would  no  longer  be  a  valid 
hypothesis,  and  would  therefore  be  deleted.  A  new  vertex 
might  then  be  hypothesized. 

2.  When  a  face  is  completed  by  connecting  its  two  end  points 
(Fig.  25b),  two  new  vertices  and  one  new  edge  are 
hypothesized.  The  new  edge  c4  is  dependent  on  both  el  and 
e3.  while  the  new  vertices  vl  and  v2  arc  dependent  on  the 
edges  on  which  they  lie. 

When  a  confirmed  edge  or  vertex  in  the  model  is  modified  or  deleted, 
the  set  of  all  hypothesized  elements  that  depend  on  it  are  deleted. 
Recursively,  elements  depending  on  deleted  ones  are  also  deleted. 

7.2.  Merging 

The  procedure  that  merges  corresponding  wire-frame  and  model 
objects  takes  into  account  the  fact  that  the  3-spacc  positions  of  end  points 
of  edges  that  arc  confirmed  vertices  arc  generally  .much  more  accurate 
than  the  positions  of  non-vertex  end  points.  Therefore,  confirmed 
vertices  are  given  more  weight  during  merging.  As  an  example,  consider 
Fig.  26.  Suppose  the  wire-frame  object  in  (b)  is  to  be  merged  with  ihe 
mode!  object  in  (a),  and  the  corresponding  edges  and  vertices  are  as 
follows:  (v2,  vlOO)  (v3.  viol),  (e2.  elOO).  (e3,  elOl).  (e4,el02),  (el2. 
e/04).  We  assume  the  wire-frame  object  has  been  transformed  to  register 
with  the  model  object. 


ta)  (.b) 

Figure  25:  Geneiaiing  dependencies  for  hypothesized  edges  uid 
vertices  Ihc  dependence  of  an  element  on  another  ts  depicted  3t»  an 
arrow  from  the  former  to  the  latter  (a)  Two  non  touching  partial 
faces  are  merged  (b)  A  face  is  completed 
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Figure  26:  The  wire- frame  object  in  (b)  is  to  be  merged  with  the 
model  object  in  (a)  The  confirmed  edges  of  the  model  object 
(indicated  by  solid  lines)  are  el.  e2.  e3.  e4,  and  ell.  the  confirmed 
veruccs  (indicated  by  circles)  arc  v/.  v2.  and  v3  Dashed  lines 
represent  hypothesized  edges  (c)  The  result  after  merging. 


The  merging  procedure  starts  by  merging  corresponding  vertices.  Pairs 
of  vertices  (fv2,  vlOO)  and  (v3.  viol)  in  Fig.  26)  are  combined  into  single 
vertices  with  coordinates  of  the  midpoint  between  them.  At  this  point, 
all  corresponding  pairs  of  edges  will  share  at  least  one  vertex.  The 
corresponding  edges  are  merged  next  as  follows: 

1.  If  the  two  edges  share  both  their  vertices  ((e3.  elOl)  in  Fig. 

26),  the  new  edge  connects  the  two  new  vertices  already 
generated. 

2.  If  one  edge  has  two  confirmed  ■  crtices  but  the  other  does  not 
((e2,  el 00)  and  (W.  e/02)  in  Fig.  26),  the  new  edge  is  the  same 

as  the  former.  Notice  that  the  non-verte*  end  point  in  this 
case  is  given  zero  weight 

3.  If  the  two  edges  share  one  vertex  and  the  other  end  points  arc 
not  confirmed  (fW2.  el 04)  in  Fig.  26),  the  new  edge  is  the 
"average"  of  the  two  edges. 

If  a  model  edge  to  be  merged  contains  only  one  confirmed  vertex  (e.g., 
e4  and  el 2  in  F'ig.  26).  then  all  hypothesized  elements  that  recursively 
depend  on  this  edge  are  deleted.  For  example,  the  hypothesized  elements 
that  recursively  depend  on  e4  in  Fig.  26  arc  the  vertices  v4  and  v7,  and 
the  edges  eS.  elO.  .-9,  and  ell. 


Fixurt  22:  Perspective  views  of  buildings  derived  by  incorporating 
the  wire-frame  data  in  Fig  24  into  the  model  in  Fig  20 


After  all  corresponding  elements  of  the  two  objects  have  been  merged, 
the  edges  and  vertices  remaining  in  the  wireframe  object  that  were  not 
merged  ( el03  in  Fig.  26)  are  added  to  the  model  object.  The  final 
configuration  after  merging  is  shown  in  Fig.  26c.  This  object  is 
incomplete  and  must  be  completed  using  the  techniques  described  in  an 
earlier  section. 


7.3.  Results  of  Merging 

When  these  procedures  arc  applied  to  the  wire-frame  data  in  Fig. 
24  and  the  scene  model  in  Fig.  20.  we  obtain  the  updated  scene  model 
shown  in  Fig.  27.  The  updated  version  has  two  important  improvements 
over  the  initial  version.  First,  the  updated  model  contains  more 
buildings  since  new  wire  frame  data,  some  of  which  represent  new 
buildings,  have  been  incorporated  into  the  initial  model.  Second,  for 
many  buildings  described  in  both  versions  of  the  model,  the  positions  of 
vertices  and  edges  arc  more  accurate  in  the  updated  version.  This  is 
because  many  hypothesized  vertices  and  edges  arc  replaced  by  accurate 
ones  obtained  from  the  new  data,  and  many  confirmed  vertices  and 
edges  are  merged  with  corresponding  ones  in  the  data  by  "averaging" 
their  positions,  generally  decreasing  the  amount  of  error. 

The  shape  of  the  large  hole  in  the  roof  of  one  of  the  buildings  has 
changed  from  a  rectangle  in  the  initial  model  to  an  almost  triangular 
quadrilateral  in  the  updated  version.  When  compared  with  the  source 
images  in  Fig.  4,  the  rectangular  shape  would  seem  more  accurate. 
However,  the  positions  of  the  edges  and  vertices  that  form  the  hole  are 
more  accurate  in  the  updated  model  in  the  sense  that  they  are  more 
faithful  to  the  wire  frame  descriptions  derived  from  the  images. 


This  experiment  demonstrates  how  information  provided  by  each 
additional  view  allows  the  model  to  be  incrementally  made  more 
complete  and  accurate. 
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8.  Conclusions 

We  sc(  om  10  develop  an  entire  vision  system  to  interpret  complex 
images,  one  that  goes  all  the  way  from  images  to  symbolic  3D 
descriptions.  Th<  following  arc  some  conclusions  we  can  draw  from  this 
project. 

1.  Complex  images  usually  cannot  be  fully  interpreted. 
Difficulties  in  interpretation  ansc  not  only  from  occlusions, 
but  also  from  variations  in  surface  tenure  and  reflectance, 
variations  in  shape,  and  complex  lighting  conditions.  Our 
vision  systems  must  therefore  have  the  capability  to  deal  with 
approximate,  imperfect  scene  descriptions  when  perfuming 
tasks  such  as  matching,  path  planning,  or  model-based  image 
interpretation. 

2.  Incremental  reconstruction  of  complex  scenes  will  often  be 
necessary.  Multiple  views  arc  required  to  effectively 
reconstruct  complex  scenes.  A  system  that  moves  about  and 
interacts  with  its  environment  in  order  to  obtain  the  multiple 
views  will  be  able  to  gradually  add  more  information  to  its 
scene  model  at  the  same  time  that  it  carries  out  its  other  tasks. 

3.  Scene  descriptions  are  often  more  useful  if  they  contain 
reasonable  hypotheses  for  parts  of  the  scene  for  which  there 
are  only  partial  or  no  data.  For  example,  path  planning 
cannot  be  done  for  occluded  regions  of  the  scene  without  a 
good  guess  about  what  lies  in  these  regions.  If  die  hypotheses 
turn  out  to  be  incorrect,  they  should  eventually  be  modified. 

Our  vision  systems  must  therefore  have  mechanisms  for 
intelligently  generating  hypotheses,  verifying  them,  and 
modifying  them. 

4.  Task-specific  knowledge  is  very  useful  at  all  levels  of 
complex  image  interpretation,  from  low-level  image  analysis 
to  high-level  formation  of  symbolic  descriptions.  Knowledge 
of  block-shaped  objects  in  an  urban  scene  is  used  in  the  3D 
Mosaic  system  for  stereo  analysis,  monocular  analysis,  and 
reconstructing  shapes  from  the  wire  frames. 

5.  Stereo  matching  of  2D  structural  features  (such  as  junctions) 
may  be  important  for  complex  images  and  should  be  further 
investigated. 
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Abstract 

This  paper  describes  a  new  technique  for  use  in  the  auto¬ 
matic  production  of  digital  terrain  models  from  stereo  pairs  of 
aerial  images.  This  technique  employs  a  coarse-to-fine  hierarchi¬ 
cal  control  structure  both  for  global  constraint  propagation  and 
for  efficiency.  By  the  use  of  disparity  estimates  from  coarser  lev¬ 
els  of  the  hierarchy,  one  of  the  images  is  geometrically  warped  to 
improve  the  performance  of  the  cross-correlation-based  match¬ 
ing  operator.  A  newly  developed  surface  interpolation  algorithm 
is  used  to  fill  holes  wherever  the  matching  operator  fails.  Ex¬ 
perimental  results  for  the  Phoenix  Mountain  Park  data  set  are 
presented  and  compared  with  those  obtained  by  ETL. 

1  Introduction 

The  primary  objective  of  this  research  was  to  explore  new 
approaches  to  automated  stereo  compilation  for  producing  digi¬ 
tal  terrain  models  from  stereo  pairs  of  aerial  images.  This  paper 
presents  an  overview  of  the  hierarchical  warp  stereo  (HWS)  ap¬ 
proach  ,  and  shows  experimental  results  when  it  is  applied  to  the 
ETL  Phoenix  Mountain  Park  data  set. 

The  stereo  images  are  assumed  to  be  typical  aerial-mapping 
pairs,  such  as  those  used  by  USGS  and  DMA.  Such  pairs  of  im¬ 
ages  are  different  perspective  views  of  a  3-D  surface  acquired  at 
approximately  the  same  time  and  illumination  angles.  Normally 
these  views  are  taken  with  the  camera  looking  straight  downward. 
The  major  effect  of  non  verticality  is  to  increase  the  incidence  of 
occlusion,  which  increases  the  difficulty  of  point  correspondence. 

We  shall  call  one  of  these  images  the  “reference  image,”  and 
the  other  the  “target  image.”  We  will  be  searching  in  the  target 
image  for  the  point  that  best  matches  a  specified  point  in  the 
reference  image. 

It  is  also  assumed  that  the  epipolar  model  for  the  stereo  pair 
is  known,  which  means  that  for  any  given  point  in  one  image 
we  can  determine  a  line  segment  in  the  other  image  that  must 
contain  the  point,  unless  it  is  occluded  from  view  by  other  points 
on  the  3-D  surface.  This  io  certainly  a  reasonable  assumption, 
since  an  approximation  to  the  epipolar  model  can  be  derived 
from  a  relatively  small  number  of  point  correspondences  if  the 
parameters  of  the  imaging  platform  are  not  known  a  priori. 

The  primary  goal  is  to  automatically  determine  correspon¬ 
dences  between  points  in  the  two  images,  subject  to  the  following 
criteria: 
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•  Minimize  the  rms  difference  between  the  disparity  mea¬ 
surements  and  “ground  truth.”  Without  ground  truth,  we 
cannot  measure  this. 

•  Maximize  the  sensitivity  of  the  disparity  measurements  to 
small-scale  terrain  features,  while  minimizing  the  effects  of 
noise. 

•  Minimize  the  frequency  of  false  matches. 

•  Minimize  the  frequency  of  match  failures. 

These  criteria  are  mutually  exclusive.  Under  ideal  conditions, 
increasing  the  size  of  the  match  operator  decreases  the  effects 
of  noise  on  the  disparity  measurement,  but  it  also  diminishes 
sensitivity  to  small  terrain  features.  Similarly,  tightening  the 
match  acceptance  criteria  reduces  the  frequency  of  false  matches, 
but  results  in  more  frequent  match  failures. 

One  of  the  goals  of  this  system  is  to  minimize  the  number 
of  parameters  that  must  be  adjusted  individually  for  each  stereo 
pair  to  get  optimum  performance. 

2  Approach 

This  section  briefly  explains  the  HWS  approach,  which  con¬ 
sists  of  three  major  components: 

•  Coarse-to-fine  hierarchical  control  structure  for  global  con¬ 
straint  propagation  as  well  as  for  efficiency. 

•  Disparity  surface  interpolation  to  fill  holes  wherever  the 
matching  operator  fails. 

•  Geometric  warping  of  the  target  image  by  using  disparity 
estimates  from  coarser  levels  of  the  hierarchy  to  improve  the 
performance  of  the  cross-correlation-based  matching  oper¬ 
ator. 

2.1  The  Use  of  Hierarchy  and  Surface  Interpola¬ 
tion  to  Propagate  Global  Constraints 

The  goal  of  stereo  correspondence  is  to  find  the  point  in  the 
target  image  that  corresponds  to  the  same  3-D  surface  point  as 
a  given  point  in  the  reference  image.  It  is  often  impossible  to 
select  the  correct  match  point  with  only  the  image  information 
that  is  local  to  the  given  point  in  the  reference  image  in  combina¬ 
tion  with  the  image  nformation  along  the  epipolar  line  segment 
in  the  target  image.  When  the  3-D  surface  contains  a  replicated 
pattern,  there  is  the  likelihood  of  match  point  ambiguity.  Let  us 
consider,  for  example,  a  stereo  pair  that  contains  a  parking  lot 


with  repetitive  markings  delimiting  the  parking  spaces.  Around 
the  edges  of  the  lot  there  are  image  points  that  can  be  ma'  hed 
unambiguously.  Within  the  parking  lot,  ambiguity  is  likely,  de¬ 
pending  on  the  orientation  of  the  repetititive  patterns  with  the 
epipolar  line.  A  successful  stereo  correspondence  system  must  be 
able  to  use  global  match  information  to  resolve  local  match-point 
ambiguity. 

HWS  approaches  this  problem  in  two  ways.  First,  global 
constraints  on  matches  are  propagated  by  the  coarse-to-fine  pro¬ 
gression  of  the  matching  process.  Disparities  computed  at  lower 
resolution  are  employed  to  constrain  the  search  in  the  target  im¬ 
age  to  a  small  region  of  the  epipolar  line,  which  also  greatly 
reduces  the  probability  of  selecting  the  wrong  point  when  am¬ 
biguity  is  present.  Second,  whenever  the  match  process  fails  to 
find  a  suitable  match  or  detects  a  possible  match  ambiguity,  a 
disparity  estimate  is  inserted  that  is  based  on  a  surface  interpo¬ 
lation  algorithm,  which  uses  information  from  a  neighborhood 
around  the  disparity  i.jle,”  with  the  size  of  the  neighborhood 
depending  on  the  number  of  neighboring  “holes.” 

2.2  The  Use  of  Image  Warping  to  Improve  Corre¬ 
lation  Operator  Performance 

One  of  the  greatest  problems  in  the  use  of  area  correlation  for 
match  point  determination  is  the  distortion  that  occurs  because 
of  disparity  changes  within  the  correlation  window.  Since  area- 
based  correlation  matches  areas,  rather  than  individual  points, 
the  disparity  it  calculates  is  influenced  by  the  disparities  of  all  of 
the  points  in  the  window,  not  just  the  point  at  the  center.  When 
there  are  high  disparity  gradients  or  disparity  discontinuities,  the 
correlation  calculated  for  the  correct  disparity  can  actually  be  so 
poor  that  some  other  disparity  will  have  a  higher  correlation 
score. 

The  effect  of  correlation  window  distortion  can  be  greatly 
mitigated  in  a  hierarchical  system  by  using  the  disparity  esti¬ 
mates  from  the  previous  level  of  matching  to  warp  the  target 
image  geometrically  at  its  current  resolution  level  into  closer  cor¬ 
respondence  with  the  reference  image. 


2.3  Related  Work 

Norvelle  [1]  implemented  a  semi  automatic  stereo  compila¬ 
tion  system  at  the  U.S.  Army  Engineer  Topographic  Laboratories 
(ETL)  that  operates  in  a  single  pass  through  the  images.  It  uses 
disparity  surface  extrapolation  both  to  predict  the  region  of  the 
epipolar  segment  for  matching  and  to  estimate  the  local  surface 
orientation  so  as  to  warp  the  correlation  window.  He  found  that 
these  techniques  improved  the  performance  of  the  system  sig¬ 
nificantly,  but  that  considerable  manual  intervention  was  needed 
when  the  surface  extrapolator  made  bad  predictions,  or  when  the 
image  contained  areas  with  no  information  for  matching,  with 
ambiguities,  or  with  occlusions. 

3  Sequence  Of  Operations  In  Hierarchical 
Warp  Stereo 

Figure  1  illustrates  the  hierarchical  control  structure  of  the 
system. 
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FIGURE  1 

Block  Diagram  of  Hierarchical  Control  Structure 


1.  Initialize: 

•  Start  with  a  stereo  pair  of  images  (assumed  to  be  of 
the  same  dimensions). 

•  Call  one  of  these  images  the  “reference  image,”  the 
other  the  “target  image.” 

•  Construct  Gaussian  pyramids  (Burt  [2])  re/erence; 
and  target,  for  each  image.  The  images  at  level  «  in 
these  pyramids  correspond  to  reductions  of  the  origi¬ 
nal  images  by  a  factor  of  2'. 

•  Set  disp-i  to  either  the  a  priori  disparity  estimates  or 
all  zeros. 

•  Start  the  iteration  at  level  t  =  0. 

•  Choose  the  pyramid  depth  D  so  that: 

D  =  ceiling  (I og2(uncertainty))  -  1. 

where  uncertainty  is  an  estimate  of  t  le  maximum 
difference  between  disp^  i  and  the  “true"  disparities. 
This  guarantees  the  “true”  disparities  will  be  within 
the  range  (-2  :  +2)  at  level  0  of  the  matching. 

2.  Warp:  Use  the  disparity  estimates  2  *  disp,_j  to  warp 
tarpeto-,-  geometrically  into  approximate  alignment  with 
referenceo-i ■  Note  that  the  factor  of  two  is  equal  to  the 
ratio  of  image  scales  between  level  i  and  level  i  -  1  of  the 
hierarchy. 

3.  Match:  Using  the  matching  operator,  compute  the  resid¬ 
ual  disparities  A dispj  between  the  warped  target  and  the 
reference  images  at  level  i. 


150 


4.  Refine:  Compute  the  refined  disparity  estimates: 

diapi  =  2  *  dispel  +  Adisp, 

5.  Fill:  Use  the  surface  interpolation  algorithm  to  fill  in  dis¬ 
parities  estimates  at  positions  where  matching  operator 
fails  because  of  no  image  contrast,  ambiguity,  etc. 

6.  Increase  resolution:  If  i  =  D,  quit;  otherwise  let  i  =  i  +  1 
and  go  to  Step  2. 

4  Disparity  Estimation 

Disparity  estimation  consists  of  three  parts: 

•  Computing  match  operator  scores  for  disparities  along  an 
epipolar  segment. 

•  Accepting  or  rejecting  the  collection  of  scores  according  to 
a  model  for  the  shape  of  the  correlation  peak. 

•  Estimating  the  subpixel  disparities  at  acceptable  peaks. 

4.1  Match  Score  Operator 

The  HWS  approach  presented  here  can  be  implemented  with 
a  variety  of  match  operators.  All  results  reported  here  were 
obtained  with  an  operator  that  closely  approximates  Gaussian- 
weighted  normalized  cross  correlation.  The  values  of  the  Gaus¬ 
sian  weights  decrease  with  Euclidean  distance  from  the  center  of 
a  square  correlation  window.  In  the  examples  shown  here,  the 
window  dimension  is  13  x  13  pixels  with  a  standard  deviation  of 
approximately  2  pixels  in  the  Gaussian  weights.  Preliminary  re¬ 
sults  indicate  that  the  Gaussian-weighted  correlation  operator  is 
better  than  uniformly  weighted  correlation  operators  at  locating 
changes  in  disparity  while  maintaining  a  given  level  of  disparity 
precision. 

4.2  Evaluation  of  Correlation  Surface  Shape 

The  match  operator  reports  a  failure  if  any  of  the  following 
conditions  exist: 

•  Disparity  out  of  range:  The  maximum  match  score  is  found 
at  either  extreme  of  the  epi-polar  segment. 

•  Multiple  peaks:  The  best  and  next  best  match  scores  is 
found  at  disparities  that  differ  by  more  than  one  pixel. 

There  are  other  models  for  the  expected  shape  of  the  corre¬ 
lation  surface  that  can  be  based  on  the  autocorrelation  surface 
shape  of  the  windows  in  the  reference  and  target  images.  Further 
investigation  is  needed  to  evaluate  the  utility  of  such  models  for 
both  surface  shape  evaluation  and  disparity  estimation. 

4.3  Subpixel  Disparity  Estimation 

The  subpixel  location  of  the  correlation  surtace  peak  is  esti¬ 
mated  by  parabolic  interpolation  of  both  the  x  and  y  directions 
of  disparity.  For  each  direction,  three  adjacent  match  scores  - 
*i-l ,  Si,  and  s;+i,  where  Sj  is  the  maximum  score  -  are  used  to 
compute  the  peak  as  follows: 

5  *  **+i  ~  *i'-l 

2  *  s,  -  ai+ 1  -  s,_i 


More  complicated  approaches  to  peak  estimation,  such  as 
two-dimensional  least-squares  fitting  of  the  correlation  surface, 
might  yield  better  estimates,  but  at  a  higher  computational  cost. 

5  Surface  Interpolation  Algorithm 

The  goal  of  the  surface  interpolation  algorithm  is  to  estimate 
values  for  the  disparity  surface  at  points  where  the  match  op¬ 
erator  reported  failure;  such  points  will  be  called  “holes.”  The 
approach  to  filling  a  hole  at  location  z,  y  is  to  model  the  surface 
by  employing  the  disparity  measurements  ever  the  set  of  non¬ 
holes  H  in  the  n  X  n  pixel  neighborh  >d  centered  at  z,y.  The 
set  H  contains  the  indices  of  all  holes  in  the  neighborhood. 

This  surface  interpolation  algorithm  is  based  on  the  solution 
to  the  hyperbolic  multiquadric  equations  described  in  Smith  (3|. 
The  surface  is  known  at  the  set  of  points  z, ,  y, ,  z,  where  t  e  7f, 
and  can  be  estimated  at  other  points  h  €  H  by  the  formula 

*(**»! fk)  =  51  Ci  *9(*h  -  *i,Vk  -  y.), 

i€M 

where  g  is  the  basis  function  for  the  surface  res  presen  tat  ion,  and 
coefficients  c,  are  th-  solutions  to  the  set  of  linear  equations: 

*(*>.*;)  =  ^2ci*9(xi-xjtyt  -yy)  for  all  j  €  77 
ieli 

Clearly,  this  irregular  grid  solution  could  be  used  to  compute 
the  surface  values  at  the  holes  in  the  disparity  data,  but  this 
involves  solving  for  the  coefficients  c,  for  each  different  configu¬ 
ration  of  holes  and  nonholes  in  the  n  x  n  neighborhoods  of  the 
disparity  surface. 

An  alternative  approacii,  which  is  used  here,  is  to  convert  the 
quasi-regular  grid  problem  into  a  regular  grid  problem  in  which 
each  c,  at  a  hole  is  forced  to  be  zero,  and  the  corresponding  z, 
remains  as  an  unknown.  This  results  in  the  same  solution  that 
would  have  been  obtained  from  the  irregular  grid  formulation 
and  produces  the  following  system  of  linear  equations: 

zL  Av!  *  *■'  =  ~  Y.  Akj  *zi  for  alike  H,  (1) 
,eH  yew 

where  A~ 1  is  the  inverse  of  the  matrix  A,  j  =  g(x, -zJtyx  -  y; ) 
for  i  ,j  e  HoH .  This  system  of  equations  must  be  solved  for  each 
z,  for  i  e  H .  Thus,  we  have  reduced  the  size  of  the  linear  system 
of  equations  that  must  be  solved  from  the  number  of  elements 
in  H  to  the  number  of  elements  in  H.  Of  course,  the  matrix  A 
must  be  computed  and  inverted  once. 

Areas  on  the  disparity  surface  that  contain  large  clusters  of 
holes  cause  problems.  The  previous  surfs  rpolation  algo¬ 
rithm  degenerates  to  a  surface  extrapolatioi  1m  when  the 

nonholes  in  the  neighborhood  are  not  more  01  nos  isotropically 
distributed  over  the  entire  neighborhood.  The  problem  can  be 
overcome  by  increasing  the  size  of  the  neighborhood  until  some 
spatial-distribution  criterion  is  met,  but  this  would  require  solv¬ 
ing  extremely  large  linear  systems. 

Large  holes  are  filled  by  means  of  the  following  hierarchical 
approach: 

Procedure  Surface-Interpo!ate(sur/oce,) 

1.  If  surfacei  contains  large  holes  then 
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(a)  Compute  filled-surfacel+i  = 
ezpand(surface-interpolate(reduce(sur/acei))), 
where  reduce  computes  a  Gaussian  convolution  reduc¬ 
tion  by  a  factor  of  two,  surface-interpolate  is  a  recur¬ 
sion  call  to  this  interpolation  algorithm,  and  expand 
computes  expansion  by  a  factor  of  two,  using  bilinear 
interpolation. 

(b)  For  each  hole  in  sur  f  ace{  that  is  completely  surrounded 
by  other  holes,  fill  the  hole  with  the  value  from  the 
filled-surface,+i. 

2.  For  each  hole  in  surface ,  fill  the  hole  by  solving  the  system 
of  linear  equations  (l)  for  the  n  x  n  pixel  neighborhood 
centered  at  the  hole  (n  =  7  in  the  examples). 

3.  Return  the  filled  sur/ace,. 

6  Examples 

This  section  describes  the  experimental  results  achieved  when 
the  HVVS  technique  was  applied  to  areas  of  the  ETL  Phoenix 
Mountain  Park  data  set,  and  compares  these  results  to  those  ob¬ 
tained  from  the  semiautomatic  system  developed  by  Norvelle  [1]. 

The  following  components  of  the  Phoenix  Mountain  Park 
data  set  were  used: 

•  Left  image:  2048  x  2018  pixels,  8  bits  per  pixel 

•  Right  image:  2018  x  2048  pixels,  8  bits  per  pixel 

•  x-correspondence  array:  400  x  400  points  ,  floating  point. 

The  left  and  right  images  had  been  scanned  such  that  the 
epipolar  lines  were  almostly  exactly  horizontal.  The  ETL  x- 
correspondence  array  was  converted  to  an  x-disparity  image  to 
enable  comparison  between  ETL  and  HVVS  results. 

Results  are  shown  for  two  different  areas  of  the  Phoenix  data 
set.  All  disparity  measurements  are  indicated  in  terms  of  pixel 
distances  in  the  2048  x  2048  Phoenix  stereo  pair,  rather  than  the 
resolution  of  the  selected  windows. 

•  Area  A  is  defined  by  two  approximately  aligned  150  x  150- 
pixel  windows  of  the  Phoenix  pairs  which  were  reduced 
by  a  factor  of  four  (the  windows  thus  corresponding  to  the 
600  x  600-pixel  windows  of  the  originals).  The  measured 
disparities  for  area  A  range  from  -40  to  +16  pixels. 

•  Area  B  is  defined  by  two  approximately  aligned  125  x  125- 
pixel  windows  of  the  Phoenix  pairs  which  were  reduced 
by  a  factor  of  two  (the  windows  thus  corresponding  to  the 
250  x  250-pixel  windows  of  the  originals).  The  measured 
disparities  for  area  B  range  from  -40  to  -34  pixels. 

Figures  2  and  3  show  the  inputs  and  outputs  of  three  levels 
of  the  hierarchy  for  areas  A  and  B,  respectively.  Columns  1  and 
2  are  the  reference  and  target  images  at  each  level.  Column  3 
is  a  binary  image  that  indicates  the  positions  of  match  failures. 
Column  4  shows  the  resulting  disparity  image  of  each  level  after 
the  match  failures  have  been  replaced  by  surface-interpolated 
disparity  values. 

Figures  4  and  5  contain  a  comparison  of  the  HVVS  results  with 
those  obtained  at  ETL  by  Norvelle  for  areas  A  and  B  respectively. 


The  bottom-left  images  of  figures  4  and  5  show  the  pixel-by¬ 
pixel  differences,  after  contrast  enhancement,  between  the  HVVS 
and  ETL  disparities.  The  graphs  to  the  right  of  these  difference 
images  depict  the  histograms  of  these  differences. 

The  mean  and  standard-deviation  values  shown  with  the  his¬ 
tograms  provide  a  useful  quantative  comparision  between  the 
HVVS  and  ETL  results.  They  show  that  the  average  disparity 
differences  were  .082  and  .025  pixels,  and  that  the  standard  devi¬ 
ations  of  the  disparity  differences  were  .67  and  .34  pixels  for  the 
A  and  B  window  pairs,  respectively,  in  terms  of  pixel  distances 
in  the  2048  x  2048  Phoenix  pairs.  These  standard  deviations  be¬ 
come  .17  and  .17  pixels  when  expressed  relative  to  the  scales  of 
A  and  B  windows,  respectively. 

Similar  results  have  been  achieved  for  other  examp1'-'!  that 
include  both  higher  resolution  and  larger  windows. 

7  Problems 

HVVS  is  still  very  experimental.  Some  of  the  parameters  that 
affect  the  system,  such  as  the  range  of  disparities  to  compute  at 
each  level  of  hierarchy  and  the  size  of  the  correlation  operator, 
are  still  specified  manually. 

There  are  problems  in  estimating  the  range  of  disparities  to 
be  computed  at  each  level  of  the  hierarchy.  If  the  estimate  is 
too  low,  there  will  be  frequent  out-of-range  match  failures.  If, 
on  the  other  hand,  the  estimate  is  too  high,  computation  time 
will  increase  and  there  will  be  more  potential  for  match  point 
ambiguity. 

HVVS  has  difficulty  dealing  with  steep  terrain  features  that 
have  small  image  projections,  but  large  disparities.  At  low  res¬ 
olutions  in  the  matching  hierarchy,  the  disparities  of  the  terrain 
surrounding  the  feature  dominate  those  of  the  feature  itself,  re¬ 
sulting  in  a  disparity  estimate  that  is  usually  intermediate  be¬ 
tween  that  of  the  feature  and  that  of  the  surround.  At  higher 
resolutions  in  matching,  the  disparity  of  the  steep  feature  may 
be  outside  the  permissible  disparity  range. 

HVVS  has  even  greater  problems  with  oblique  stereo  pairs 
containing  many  occlusions.  At  low  matching  resolution,  the  dis¬ 
parities  of  foreground  and  background  in  the  same  neighborhoods 
cannot  be  distinguished.  As  the  matching  resolution  increases, 
foreground  and  background  features  are  discernible  as  separate 
objects,  but  their  disparities  arc  out  of  range  for  the  matcher. 

Most  of  the  difficulties  caused  by  sudden  changes  in  disparity 
might  be  solved  by  preceding  the  disparity  surface  interpolation 
step  with  an  algorithm  that  attempts  to  match  still  unmatched 
regions  ir  .  reference  image  with  regions  in  the  target  image 
that  likewise  j.ave  not  yet  been  matched.  We  thus  attempt  to 
match  holes  with  holes. 

8  Conclusions 

HVVS  produces  very  good  results  for  vertical  stereo  pairs  of 
rolling  terrain.  With  the  incluson  of  a  hole-to-hole  matching  step, 
HVVS  should  be  capable  of  comparable  performance  for  terrain 
characterized  by  steep  slopes  and  frequent  occlusions. 
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FIGURE  2  HWS  results  for  area  A 
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FIGURE  3  HWS  results  for  area  B 
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FIGURE  4  HWS  vs.  ETL  results  for  area  A 


FIGURE  5  HWS  vs.  ETL  results  for  area  B 
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Abstract 

First,  an  edge  is  defined  in  terms  of  cdgels  i.e.  short, 
linear  segments,  each  characterized  by  a  direction  and  a 
position.  These,  in  turn,  correspond  to  discontinuities  in 
the  intensity  projile  of  the  imaged  scene.  The  problems 
associated  with  zero- crossings  of  the  second  derivative  are 
then  discussed.  An  alternate  approach,  which  is  a  vari¬ 
ant  of  surface  -fitting,  is  presented.  This  fits  a  series  of 
one-dimensional  surfaces  to  each  window,  and  accepts  the 
surface  with  the  smallest  adeguate  basis.  The  tanh  is  an 
adeguate  basts  for  a  step-edge,  and  its  combinations  are 
adeguate  for  the  roof-edge  and  the  line-edge.  This  method 
is  robust  w.r.t.  noise;  it  has  subpixel  position  resolution 
and  a  5"  angle  resolution;  it  is  insensitive  to  high  intensity 
gradients;  and  further,  it  takes  into  account  convolution 
with  a  gaussian,  which  to  a  first  approximation  describes 
the  imaging  process.  These  claims  are  substantiated  with 
processed  images. 

I.  Introduction 

It  is  hard  to  overemphasize  the  importance  of  edge- 
dclcction  in  image  understanding  and  yet,  the  common 
view  in  the  community  seems  to  be  that  the  problem  is  far 
from  solved. 

We  begin,  in  section  II,  by  giving  a  definition  of  an 
edge  in  terms  of  the  intensity  prolile  which  would  be 
recorded  by  a  perfect  imaging  system.  Then,  in  section  III, 
some  of  the  problems  associated  with  detection  based  on 
zero-crossings  of  the  second  derivative  arc  discussed.  Much 
of  the  work  to  date,  has  used  a  variant  of  this  criteria.  In 
this  paper  a  variant  of  the  surface-fitting  approach  is  used; 
however,  there  are  at  least  five  significant  differences  from 
most  previous  approaches.  I)  A  one-dimensional  surface, 
i.e.  a  surface  constrained  to  be  constant  in  one  dimension, 
is  used.  It  result?  in  clffectivcly  filtering  out  the  noise  with¬ 
out  affecting  the  edges.  2)  It  is  not  sought  to  mark  pixels 
as  belonging  to  an  edge,  but  to  detect  short,  linear  pieces 
of  edges,  called  cdgels.  Thus,  to  display  our  edges,  which 
have  sub- pixel  localization,  we  use  a  finer  grid  than  that 
of  the  image.  3)  The  blurring  function  of  the  imaging  sys¬ 
tem,  which  is  a  gaussian  to  a  first-order  approximation,  is 


taken  into  account.  This  is  considered  more  appropriate 
than  edge-detection  preceded  by  deconvolution,  keeping  in 
view  the  ill-conditioned  nature  of  the  latter.  4)  An  ade¬ 
quate  basis  has  been  found  not  only  for  most  step-edges, 
but  also  for  roof-edges  and  line-edges.  These  arc  various 
configurations  of  the  tanh  function.  5)  Although  it  does 
not  consider  the  problem  of  multiple  detection  for  the  first 
time,  the  philosophy  is  very  different.  We  believe  that  if 
an  edge-detector  bits  sub-pixel  localization,  the  multiple 
responses  of  each  edge  would  lie  within  sub-pixel  dimen¬ 
sions  and  any  of  them  is  a  good  enough  estimate  of  the 
edge.  This  is,  by  all  means,  preferable  to  a  single  estimate 
which  has  worse  localization.  In  fact,  it  may  be  possible 
to  combine  this  wealth  of  information  to  get  a  refined  es¬ 
timate.  These  and  other  details  are  discussed  in  section 
IV.  The  precise  algorithm,  which  has  been  implemented 
for  step-edges,  is  listed  step-wise  in  section  V. 

This  approach  to  edge-detection  is  robust  w.r.t.  noise. 
Eor  ( step-size  /  >  2  it  has  suhpixcl  position  resolu¬ 

tion  and  a  5"  angle  resolution.  Further,  it  is  insensitive  to 
high  intensity  gradients  which  do  not  correspond  to  edges. 
These  claims  arc  substantiated  with  examples  in  section 
VI  and  we  conclude  with  section  VII. 

It  should  be  pointed  out  that  the  problems  of  linking 
and  scale  arc  not  considered.  It  appears  that  these  arc  not 
as  involved  as  they  may  seem,  if  the  the  cdgcl-dctcction 
produces  few  false  positives  and  false  negatives. 

II.  Definition  of  an  Edge 

Any  edge  in  an  image  can  be  thought  of  as  short  lin¬ 
ear  segments,  called  cdgels,  each  characterized  by  a  posi¬ 
tion  and  an  angle.  These  edgcls  would  correspond  to  local 
discontinuities  of  varying  order  in  an  image  viewed  with  a 
perfect  imaging  system.  To  make  ourselves  clear,  a  discon¬ 
tinuity  of  the  n"‘  order  is  one  whose  n"‘  derivative  con¬ 
tains  a  delta  function.  Hence,  the  line-edge  would  contain 
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Fig.  1.  Examples  of  Edges. 

a  discontinuity  of  the  O"1  order,  the  step-edge  of  the  1”*  or¬ 
der  and  the  roof-edge  of  the  2"'*  order.  Some  examples  of 
these  arc  shown  in  Fig.  1.  However,  the  images  we  obtain 
in  practice  are  far  from  perfect.  Not  only  are  they  noisy, 
but  also  degraded  due  to  optical  and  other  a'-errations. 
These  effects  can  be  represented  to  a  lirst-ordcr  approxi¬ 
mation,  by  convolution  with  a  gaussian  |Andr],  having  a 
certain  standard  deviation.  This  is  fortunate  because  it 
bandlimits  the  signal  before  sampling  it.  Absence  of  this 
blur  would  result  in  severe  aliasing.  However,  this  affects 
the  edges  too.  There  are  no  discontinuities  in  the  image 
anymore.  This  is  of  consequence,  as  will  be  noted  in  the 
section  on  the  zero-crossings  of  the  second  derivative. 


III.  Zero-Crossings  of  the  Second  Derivative 

Much  of  the  work  to  date  has  used  zero-crossings 
of  the  second  derivative  to  delect  and/or  localize  edges 
|Cann,  Kara  etc.].  The  problems  associated  with  this  are 
many.  Any  derivative  is  inherently  a  noise  sensitive  opera¬ 
tor.  If  the  window  lilting  technique  is  used,  and  the  basis 
is  inadequate,  as  will  often  be  the  case,  the  zero-crossing 
will  result  in  extremely  bad  localization.  For  example,  the 
typical  error  for  a  cubic  basis  used  to  detect  a  step  edge  is 
a  few  pixels. 

Further,  its  not  hard  to  see  that  we  can  have  zero- 
crossings  in  the  absence  of  an  edge  c.g.  at  the  base  of  a 
ramp  [Dinfj.  Zero-crossings  of  the  second-derivative  are 
biisically  points  of  inflexion  and  these  need  not  correspond 
to  edges  as  in  the  case  of  a  corrugated  intensity  surface. 

It  will  be  a  rare  occurrence  for  a  step-edge  to  cor¬ 
respond  to  an  underlying  mathematical  step.  We  would 
expect  that  the  intensity  surface  on  the  two  sides  of  the 
step  would  in  general  be  sloped  ns  shown  in  Fig.  1,  even  if 


the  slope  is  slight.  A  simple  analysis  (Appendix  I)  of  this 
generalized  step  convolved  with  a  gaussian,  shows,  that  in 
cases  where  both  sides  of  I  he  step  don’t  have  the  same 
slope,  the  localization  based  on  zero-crossings  would  he 
biased  by  (A Wur  /  step-size).  On  more 
than  one  occasion,  authors  have  suggested  convolving  the 
image  with  a  gaussian  before  linding  the  zero-crossings.  It 
can  be  shown  that  this  would  effectively  amount  to  having 
a  blurring  function  with  a  variance  equal  to  the  sum  of  the 
two  variances  and  hence  degrade  localization. 


IV.  The  Details 

A  variant  of  the  surface-fitting  approach  is  used  here. 
I  his  fits  a  surface  to  the  intensity-levels  of  the  pixels  and 
was  popularized  by  Haralick  [Hara].  However,  he  has  used 
a  cubic  surface,  which  is  an  inadequate  basis  for  most  edges 
and  has  a  typical  localization  error  of  a  few  pixels.  It 
further  presents  the  problem  of  multiple  detection. 

Recently,  Canny  (Cannj  has  applied  an  interesting 
functional  approach  to  edge  -detection.  The  criteria  are 
inherently  justifiable  and  yet  there  is  no  reason  why  one 
should  use  a  linear  filter,  to  meet  them.  Our  view  on  the 
problem  of  multiple  detection  has  been  stated  in  section 

1.  We  believe  that  sub-pixel  localization  makes  the  prob¬ 
lem  redundant.  Further,  most  authors  have  ignored  the 
fact  that  the  image  consists  of  samples  of  the  true  inten¬ 
sity  profile  blurred  by  the  imaging  system.  The  standard 
deviation  of  this  gaussian  blurring  function  can  be  deter¬ 
mined  by  examination  of  the  image  of  a  point  source  or 
step-edge.  As  a  result  of  this  blur,  we  have  an  image  with 
no  underlying  discontinuities.  The  spectrum  is  bandlim- 
ited  and  aliasing  avoided. 

The  noise  is  generally  assumed  to  be  additive  gaus¬ 
sian.  Some  authors  have  tried  to  reduce  this,  by  convolv¬ 
ing  the  image  with  a  gaussian.  As  discussed  in  the  previ¬ 
ous  section,  this  deteriorates  the  localization.  If  one  could 
find  the  direction  of  the  edge  in  a  reliable  fashion,  then 
the  noise  could  be  reduced  by  smoothing  the  window  in 
a  direction  parallel  to  the  edge.  This,  of  course  relics  on 
the  fact  that  the  window  is  small  enough  to  justify  the 
presence  of  edgels.  We  have  achieved  this  smoothing,  by 
fitting  to  each  window  a  one-dimensional  surface  i.e.  a 
surface  which  varies  only  in  one  direction  as  shown  in  Fig. 

2.  This  direction  would  be  perpendicular  to  the  that  of 
the  edge!  in  the  window,  if  any. 

So,  now  we  come  to  the  question  of  a  reliable  direction¬ 
finder,  for  windows  hypothesized  to  contain  edgels.  A  first- 
approximation  for  the  direction  can  be  obtained  by  fitting 
a  planar  surface  to  the  window.  This  was  found  to  have 
^directirm  <  10°  for  edges  with  ( step-size  /  cr  =  2.  A 
more  general  method  can  be  used  to  refine  this  first  esti¬ 
mate.  We  fit  a  one-dimensional  cubic  surface  to  the  nearest 
5“  and  choose  the  one  with  the  minimum  square-error  (the 
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Fig.  2.  One- Dimensional  Surface 

corresponding  equation  is  stated  in  the  appendix).  Start¬ 
ing  with  the  initial  estimate,  this  search  is  generally  not 
more  than  a  few  steps.  It  is  important  to  point  out  that 
the  plot  of  the  square-error  vs  angle  is  bowl-shaped  and 
centered  around  the  true  angle,  for  windows  containing 
edgels.  Hence,  once  within  the  howl,  the  standard  meth¬ 
ods  like  stcepcst-descent  can  be  used  to  find  the  minima. 
For  edges  with  the  step  size  twice  the  standard  deviation 
of  the  noise  the  variance  is  approximately  5". 

It  should  be  pointed  out  that  there  cannot  be  any 
one  uniq.e  basis  which  can  be  used  for  all  the  windows. 
If  we  attempt  to  do  this,  we  will  obtain  incorrect  results 
when  the  basis  is  inadequate  and  noise  sensitive  results 
if  the  basis  is  not  minimal.  Similar  cosiderations  for  one 
dimensional  steps  have  been  investigated  by  Leclerc  (I,ccl|. 

Now,  the  choice  of  an  adequate  basis.  For  most  step- 
edges  the  tanli  function  will  be  adequate  One  important 
by-product  of  employing  the  tanli  is  the  contrast  of  the 
edge.  From  our  case  studies  it  seems  that  this  would  be 
helpful  not  only  in  linking,  but  also  interpretation.  How¬ 
ever,  for  edges  which  do  not  have  similar  slopes  on  both 
sides  of  the  step,  the  tanh  is  inadequate  and  a  cubic  or  a 
tanh  with  a  cubic  might  be  adequate.  The  latter  has  some 
problems  because  the  tanh  and  cubic  are  not  completely 
independent.  It  may  be  desirable  to  employ  splines  when 
the  tanh  and  the  cubic  arc  inadequate  bases.  It  should  be 
repeated,  that  the  cubic  is  inadequate  for  most  step  edges 
and  that  the  derivative  is  not  a  very  noise-robust  operator. 
As  a  result  even  if  one  docs  use  a  cubic,  it  is  preferable  to 
localize  and  obtain  the  contrast  from  the  tanh  lit.  We  have 
used  a  cubic  for  our  detector  because  our  window  is  too 
small  (5  x  5),  for  finding  the  parameters  of  a  tanh  with 
a  cubic  or  of  splines,  in  the  case  of  horizontal  and  verti¬ 
cal  edges.  For  roof-edges  and  line-edges,  combinations  of 
the  tanh  function,  as  depicted  in  Fig.  3,  seem  to  be  ad¬ 
equate.  We  compare  the  quadratic  Gt  with  the  tanh  fit 
to  determine  the  existence  of  a  slcp-cdgel.  In  the  initial 
stages,  we  had  used  the  Chi-Square  statistics  to  determine 
the  adequacy  of  the  basis.  It  was  found  that  this  was  un¬ 
necessary  and  perhaps  undesirable  because  edges  of  high 
contrast  seemed  to  be  more  noisy. 


Step- Edge 


It  oof- Edge 


Line-Edge 
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Fig.  3.  Adequate  liases 


Some  authors  have  tried  to  compensate  for  an  inac¬ 
curate  model,  i.e.  a  mathematical  step  for  a  step-edge,  by 
using  different  scales.  Different  scales,  we  believe,  arc  un¬ 
necessary  for  most  scenes  completely  within  the  depth-of- 
lield  of  the  camera,  unless  some  of  the  shadows  are  partic¬ 
ularly  dilfu.se. 

We  have  carried  out  our  initial  investigation  for  step- 
edges,  with  images  having  a  single  scale  i.e.  all  objects  arc 
within  the  deptli-of-field  of  the  camera.  The  blurring  func¬ 
tion  for  our  camera  was  determined  to  have  a  standard  de¬ 
viation  of  O.G  pixels.  Numerically,  it  was  determined  that 
for  a  standard  deviation  of  I  and  a  mathematical  step,  the 
optimum  scaling  factor  for  the  argument  of  the  tanh  func¬ 
tion  was  0.86.  This  factor  was  determined  by  minimizing 
the  square-error.  Hence,  a  rule  of  the  thumb  for  the  scal¬ 
ing  factor  is  (0.8G  /  a, The  detection  scheme 
is  not  particularly  sensitive  to  this  factor  and  in  fact  it 
detects  reasonably  diffuse  shadows. 

The  window  size  is  determined  by  the  standard  de¬ 
viation  of  the  blurring  gaussian.  As  the  window  size  is 
increased  for  a  fixed  blur,  we  trade-oir  resolution  for  lo¬ 
calization.  Resolution  refers  to  the  minimum  support  re¬ 
quired  for  the  detection  of  an  edge)  i.e.  if  an  cdgel  can 
be  located  within  any  window  without  the  simultaneous 
presence  of  any  other  cdgel,  it  is  theoretically  resolvable. 
If  it  is  not  detected,  it  would  be  due  to  the  inadequacy 
of  the  edge-detector.  If  three  parallel  edges  are  two  pixels 
apart,  then  the  middle  edge  would  not  be  resolvable,  but 
the  other  two  might  be.  We  will  point  out  examples  of 
li.csc  in  our  first  case  study. 

V.  The  Algorithm  for  Step-Edgel  Detection 
(i)  Fit  a  planar  surface  to  the  window,  minimizing  the 
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square-error,  in  order  to  get  an  estimate  of  the  edgei’s 
directionality,  if  one  exists. 

(ii)  Refine  the  estimate,  using  a  one  dimensional  cubic 
surface  with  the  same  error  criteria.  The  resulting 
equations  are  non-linear  in  the  angle.  However,  ow¬ 
ing  to  the  reliable  initial  estimate,  this  is  typically  a 
couple  of  search  steps.  We  find  the  angle  to  the  near¬ 
est  5".  Calculate  the  2,  20  F-Statistic  for  the  planar 
and  cubic  fits.  If  it  exceeds  the  75%  threshold,  then 
declare  the  absence  of  an  cdgcl. 

(iii)  Fit  an  optimal  tanh  1-D  surface  in  the  direction  found 
in  (ii).  The  tanh  is  localized  to  the  nearest  0.1  pixel. 

(iv)  Fit  a  quadratic  over  the  window  and  compare  with  a 
tanh- fit  If  the  square-error  in  this  case  is  less  i.c.  the 
21,  21  F-Statistic  is  less  than  the  50%  threshold,  then 
declare  the  absence  of  an  cdgel. 

(v)  The  step-size  is  determined  by  the  coefficient  of  the 
laiih-fit  and  the  position,  by  its  origin.  The  intensity 
values  on  the  two  sides  of  the  step  can  be  found  from 
the  constant  in  the  basis. 

(vi)  Repeat  for  all  the  pixels,  by  shifting  the  window  in 
steps  of  one  in  the  x  ami  y  directions. 

N  il.  All  I  he  relevant  equations  and  statistics  are  listed  in 
the  Appendix  II.  If  one  wants  to  delect  slep-cdgels 
which  have  significantly  different  slopes  on  the  two 
sides  of  the  cdgcl,  steps  similar  to  (iii)  and  (iv),  but 
with  a  basis  different  than  the  tanh,  will  have  to  be 
added. 


VI.  Three  Case-Studies 

The  proof  of  the  pudding  is  in  the  eating.  It  is  essen¬ 
tial  to  point  out  some  details  concerning  the  photographs 
displayed,  a)  Only  the  step-edgcl  detector  has  been  im¬ 
plemented.  b)  The  pictures  with  the  edges  arc  displayed 
on  twice  the  frame  size  of  the  pictures  with  only  the  orig¬ 
inal  image,  for  reasons  mentioned  in  the  introduction,  c) 
These  pictures  have  the  intensity  of  the  edges  proportional 
to  their  contrast,  d)  The  cdgels  displayed  in  all  cases  have 
been  thresholdcd,  for  contrasts  less  than  twice  the  stan¬ 
dard  deviation  of  the  noise,  c)  All  edges  displayed,  are 
composed  of  the  raw  edgels  with  no  post-processing,  like 
linking,  thinning,  cleaning  etc.,  f)  When  comparing  the 
performance  of  our  approach  to  that  of  its  predecessors,  it 
would  be  desirable  to  note  the  size  of  the  original  image, 
g)  The  images  with  the  edges  superposed,  arc  formed  by 
enlarging  the  original  image  to  four  times  its  area  and  by 
making  the  edge  pixels  have  the  lowest  gray  level.  It  is 
instructive  to  refer  to  the  original  image,  the  edgcl  image 
and  the  superposed  image  simultaneously  to  scrutinize  the 
performance  of  the  edge  detector,  h)  While  doing  so,  it  is 
important  to  bear  in  mind  that  the  declaration  of  false  pos¬ 


itives  and  false  negatives  should  be  based  on  the  original 
image  and  should  not  be  influenced  by  our  preconceived 
notions  of  regular  and  suggestive  shapes. 

(i)  Industrial  Setting  :  Bin  of  Parts 

Size  of  the  Original  Image  :  128  x  128 

Standard  Deviation  of  the  Blur  :  0.6 
Standard  Deviation  of  the  Noise  :  5 

Fig.  4-a  :  The  Original  Image 

Fig.  4-b  :  The  Fdgcl-lmage 

The  pins  of  the  various  parts  have  false  negatives. 
This  is  because  they  are  bounded  by  dark  lines, 
and  our  edge  detector  has  currently  been  imple¬ 
mented  only  for  step  cdgels.  The  outer  edge  of 
the  lines  have  been  detected,  but.  the  inner  edge 
exceeds  the  resolution  capabilities  of  our  detec¬ 
tor.  Also,  notice  that  some  of  the  circular  regions 
detected  have  a  diameter  of  just  a  few  pixels. 

Fig.  4-c  :  The  Superposed  Image 
Notice  the  localization. 


(ii)  Aerial  View  :  San  Fransisco  Bay 

Size  of  the  Original  Image  :  256  x  256 

Standard  Deviation  of  the  Blur  :  0.6 
Standard  Deviation  of  the  Noise  :  5 

Fig.  5-a  :  The  Original  Image 

I  his  picture  was  chosen  because  of  its  complexity. 

Fig.  5-b  :  The  ICdgel  Image 

On  a  first  glance,  it  may  seem  to  a  casual  observer 
that  there  are  a  host  of  false  positives.  How¬ 
ever,  on  a  closer  examination  this  seems  to  be 
untrue.  The  long  lines  in  the  sea  correspond  to 
silt  lines.  It  probably  will  not  be  possible  to  con¬ 
firm  these  in  the  photographs  you  will  see  Note, 
the  continuity  in  most  edges.  It  is  unlikely  to 
have  long  continuous  false  positives.  Further  no¬ 
tice  the  bridge  in  the  upper  left  corner. 

Fig.  5-c  :  The  Superposed  linage 

Now,  note  the  localization  and  the  structure  im¬ 
posed  by  the  edgels.  Compare  the  intensity  on 
the  two  sides  of  each  cdgel.  Use  the  information 
provided  by  all  three  images,  once  again,  while 
scrutinizing  the  performance. 


Indoor  Scene  :  A  Telephone,  a  Cup  and  a  Pencil. 

Size  of  the  Original  Image  :  25G  x  256 

Standard  Deviation  of  the  Blur  :  0.6 
Standard  Deviation  of  the  Noise  :  4 
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Fig.  6-a  :  The  Original  Image 

This  image  was  chosen  to  illustrate  our  case 
against  the  misuse  of  scale.  Note  the  top  sur¬ 
face  of  the  telephone.  It  does  not  correspond  to 
a  mathematical  step,  but  to  a  generalized  step, 
which  most  detectors  would  lind  by  using  differ¬ 
ent  scales.  Also  note,  the  top  edge  of  the  book 
on  the  left  of  the  table  and  the  top  edge  of  the 
pencil. 

Fig.  6-b  :  The  Edgel  Image  using  Tanh-fit 

Note  the  missing  edges,  which  were  pointed  out  in 
the  preceding  paragraph.  These  can  be  obtained 
using  a  larger  scale,  but  we  feel  that  this  is  not 
only  avoidable,  but  also  uunccssary.  The  inner 
portion  of  (lie  flower  on  the  cup  hits  a  few  false 
negatives  due  to  lack  of  resolution. 

Fig.  6-c  :  The  Edged  Image  using  Tanh  /  Cubic 

Notice  the  detection  of  the  missing  edges.  How¬ 
ever,  now  we  have  multiple  detection,  i.c.  the 
edges  arc  thicker  than  they  would  be  without  the 
cubic,  at  some  places. 

Fig.  C-d  :  The  Superposed  Image  (using  Fig.  6-c) 

Notice  the  localization,  once  again. 

VII.  Conclusion 

This  paper  attempted  the  problem  of  edge-detection 
using  one- dimensional  surfaces.  Edges  were  defined  in 
terms  of  short,  linear  segments,  called  cdgcls.  Detc  .ion 
of  edgels  seems  to  he  more  appropriate  than  of  edge- pixels 
because  edgels  not  only  directly  capture  the  characteristics 
of  edges,  but  they  also  establish  continuity.  An  adequate 
basis  for  most  step-edgcls  was  claimed  to  be  the  tanh.  Ro¬ 
bustness  to  noise,  sub-pixel  position  localization  and  5° 
angle  localization  were  also  claimed.  A  case  against  deriva¬ 
tive  operators  and  the  unnecessary  use  of  multiple  scales 
was  presented.  Some  case  studies,  with  the  algorithm  for 
step-edgcls  implemented  for  a  single  scale,  were  then  ex¬ 
amined. 

Before  we  conclude,  we  would  like  to  make  two  inter¬ 
esting  observations.  The  blur,  caused  by  optical  aberra¬ 
tions  and  the  scanning  mechanism,  is,  to  a  certain  extent, 
needed  to  avoid  tindersampling.  In  the  C.O.I).  cameras, 
sharp  focussing  produces  aliasing.  This  is  because,  the 
image  in  these  cameras,  is  scanned  by  rectangular  non- 
overlapping  windows.  In  such  situations,  slight  do-focussing 
i.e.  just  enough  to  avoid  aliasing,  is  desirable! 

Edges  in  different  orientations  are  not  equal.  This  is 
because  of  the  use  of  a  rectangular  grid,  lodges  at  dif¬ 
ferent  angles,  have  varying  localization  capabilities.  The 
reason  is  not  hard  to  see.  Edges,  at  angles  close  to  0"  and 
90",  have  samples  at  a  minimal  number  of  points  across 
the  edge.  The  edges  at  45"  also  exhibit  coincidental  align¬ 
ment,  however  this  is  much  less  than  the  former.  Another 


way  to  look  at  it  is,  that  if  we  were  trying  to  determine 
the  parameters  of  an  adc<  mate  basis,  edgels  at  0"  and  90° 
would  require  the  largest  minimal  windows  and  cdgcls  at 
45"  would  come  next.  It  suggests,  that  for  everyday  scenes, 
with  most  edges  in  the  horizontal  and  vertical  directions, 
it  is  advisable  to  lilt  the  camera! 

As  noted  by  Binford  [Binf],  current  edge-detection 
schemes  have  primitive  capabilities,  at  best,  and  nature  has 
enough  variety  to  find  out  a  kludge  and  make  us  pay  for 
it.  This  paper  would  consider  its  purpose  served,  if  it  has 
succeeded  in  highlighting  some  of  the  issues  and  concerns 
in  edge-  detection.  It  docs  not  serve  any  purpose,  at  this 
juncture,  to  talk  .about  the  computational  requirements  of 
this  algorithm,  not  only  because  it  is  yet  to  be  considered 
from  an  efficiency  point  of  view,  but  more  so  because  we 
will  finally  have  to  resort  to  parallel  and  more  efficient  ar¬ 
chitectures.  This  algorithm  has  a  natural  extension  for 
roof  and  line  edges,  which  is  yet  to  be  implemented.  Work 
also  needs  to  be  done  on  an  adequate  basis  for  step-edgcls 
which  have  large  deviations  fron  mathematical  steps. 
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Appendix  I  :  Zero-Crossing  Bias 


Let,  the  step-edge  be  E(x)  and  the  Gaussian  be  G(x). 


k  \.x 
k?.x  +  s 


G(x)  =  e-*''2a' 


if  x  <  0 
if  x  >  0 


It  can  be  shown,  that  (E(x)  *  G(x))"  =  E(x)  *  G"(x). 
E(x)  *  G"(x)  =  f  ki(x  -  u).G"(u)  du 
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=  (*-(fc2  fr|).xj.(7'(z) 

+  (fca-*i).[*.C'(*)  C(x)| 

=  B.C'(x)  (fc2  -  fc,).C(x) 

Equating  this  to  zero  wc  get,  x  =  A .r>2/»  where 
(k  i  fcj)  and  «  is  the  step-size.  This  is  the 
biased  zero- crossing  of  the  second  derivative. 


Appendix  II  :  Least-Squares  Criteria  and  Statistics 

Least-Squares  Criterion  for  a  Planar-fit  : 

litaP  =  Elv=o(/m“tfe(I.l/l  -  (“o  +  «x-2  +  av.y))2 
(minimize  w.r.t.  a(l.  nx  and  ay) 

Initial  Estimate  of  b  :  b„  —  arctantay/a*) 

Least-Squares  Criterion  for  a  Quadratic-fit  : 

EtaQ  =  Ht.y=o(,ma9elx>v]  -  (°I>  +at.i  +  a2.z2))2 
(minimize  w.r.t.  a0,  ai  and  o2) 

2  =  x.  cos(6)  +  y.  sin(6) 

b  is  determined  from  the  L.S.E.  cubic-fit  and  is 
the  angle  by  which  the  axes  have  to  be  rotated  to 
align  the  x-axis  with  the  edgel  cross-section. 

Least-Squares  Criterion  for  a  Cubic-fit  : 

EtaC  =  E*,v=0(/mage(x,yj  ~  (“o  +  Oi.2  +  a2.z2  + 

«3.*3))* 

(minimize  w.r.t.  a„,  at,  a2  and  a3) 

2  -  x.  cos(6)  +  y.  sin(6) 

The  initial  estimate  of  b,  from  the  L.S.E.  planar 
-fit,  is  refined.  The  equations  to  be  solved  are 
non-linear  in  b 

Least-Squares  Criterion  for  a  Tanh-fit  : 

EtaT  =  Ex.,,  „(/ma9e(z,  y |  (s.  tanh(/.[z+p])  |  k })2 
(minimize  w.r.t.  s,  p  and  k) 

f  is  determined  from  the  rule  of  thumb  mentioned 
in  section  IV  i.c.  (0.80  /  o,;„u,,I„„  ,Wur).  Twice 
s  is  the  edge  contrast  and  p  is  the  position. 

Statistics  : 

EtaP  follows  the  Chi-Square  Statistic  with  22  D.O.F.. 
EtaQ  follows  the  Chi-Square  Statistic  with  approxi¬ 
mately  21  D.O.F.. 

EtaC  follows  the  Chi-Square  Statistic  with  20  D.O.F.. 
EtaT  follows  the  Chi-Square  Statistic  with  approxi¬ 
mately  21  D.O.F.. 

{10  .  (EtaP  -  EtaC)  /  EtaC}  follows  the  2,  21  F- 
Slatistic 

{EtaQ  /  EtaT}  follows  the  21,  21  F-Statistic. 
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Fig.  4-a.  Ilin  of  Parts  :  Original-Image 


Fig.  4-b.  Din  of  Parts  :  Edgel- Image 
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5-a.  Sail  Fransisro  P  ...  :  OriKinnl-Iniage 


a.  Indoor  Scour  :  Original-Image 


5-b.  San  Fransisco 
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Fig.  G-c.  Imloor  Srouo  :  lCilgol-Iuifigo  (tmill/cnbic) 
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EXTRACTING  STRAIGHT  LINES 
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ABSTRACT 

This  paper  presents  a  new  approach  to  the  extraction  of 
straight  lines  in  intensity  images.  Pixels  are  grouped  into  edge 
surrnort  regions  of  similar  gradient  orientation,  and  then  the 
structure  of  the  associated  intensity  surface  is  used  to  determine 
the  location  and  properties  of  the  edge.  The  resulting  regions 
and  extracted  parameters  form  two  separate  representations  of  a 
straight  line  segment,  pixel-based  and  symbolic,  that  can  be  used 
together  for  a  variety  of  purposes.  The  algorithm  appears  to 
be  far  more  effective  than  previous  techniques  for  two  key 
reasons:  1)  the  gradient  orientation  is  used  as  the  initial 

organizing  criterion  in  the  extraction  of  straight  lines,  instead  of 
the  gradient  magnitude;  and  2)  data  directed  organization  of  the 
complete  context  of  a  straight  line  is  determined  prior  to  any 
local  decisions  about  participating  edge  points. 


UO  INTRODUCTION 

The  organization  of  significant  local  intensity  changes  into 
the  more  global  abstractions  called  "lines”  or  "global  intensity 
boundaries"  is  an  early,  but  important,  step  in  the 
transformation  of  the  visual  signal  into  useful  intermediate 
constructs  for  interpretation  processes.  Despite  the  large  amount 
of  research  appearing  in  the  literature,  effective  extraction  of 
linear  boundaries  has  remained  a  difficult  problem  in  many 
image  domains.  There  are  two  goals  of  this  paper:  a)  the 
development  of  mechanisms  for  extracting  straight  lines  from 
complex  images,  including  intensity  discontinuities  of  arbitrarily 
low  contrast;  and  b)  the  construction  of  an  intermediate 
representation  of  edgefline  information  which  allows  high-level 
mechanisms  efficient  access  to  relevant  lines.  A  more  detailed 
presentation  can  be  found  in  [BUR84], 

We  contend  that  the  major  failings  of  line  extraction 
algorithms  are  twofold:  the  relegation  of  information  about  edge 
orientation  to  a  secondary  role  in  the  processing,  and  the  lack 
of  a  comprehensive  global  view  of  the  underlying  image 
structure  prior  to  making  local  decisions  about  edge  features. 


In  most  edge  and  line  extraction  algorithms,  the  magnitude  of 
the  intensity  change  has  been  used  in  some  manner  as  the 
dominant  measure  of  importance  of  the  local  edge.  It  is  our 
view  that  edge  orientation  carries  important  information  about 
the  spatial  extent  of  the  straight  line. 

The  technique  presented  here  was  motivated  by  a  need 
for  a  straight  line  extraction  method  which  can  find  straight 
Lines  in  reasonably  complex  images,  particularly  those  lines  that 
are  long  but  not  necessarily  of  high  contrast.  A  key 
characteristic  of  the  approach  presented  here  that  distinguishes  it 
from  most  previous  work  is  the  global  organization  of  the 
supporting  edge  context  prior  to  any  decisions  about  the 
relevance  of  local  intensity  changes.  An  estimate  of  the  local 
gradient  orientation  at  each  pixel  is  the  basis  of  these  first 
organizing  processes.  Grouping  pixels  into  edge  support  regions 
avoids  the  plethora  of  magnitude  responses  from  masks  at 
varying  sizes  and  orientations,  as  well  as  unnecessary  complexity 
in  the  subsequent  organizing  mechanisms.  The  approach 
presented  here  has  its  roots  in  the  "gradient  collection"  process 
of  Hanson  et  al  [HAN 80],  as  well  as  [rjfR78]. 

Our  approach  allows  the  extraction  of  straight  lines 
despite  weaknesses  in  line  clarity  due  to  local  edge 
inconsistencies  or  deficiencies  in  width  and  contrast.  It  directly 
addresses  the  problems  associated  with  the  size  of  the  edge 
operators  and  determines  the  extent  of  support  to  be  given  to 
edges  and  lines  directly  from  the  underlying  data. 

ISS  A  REPRESENTATION  AND  PROCESS  FOR 
EXTRACTING  STRAIGHT  LINES 

2.1  Overview 

There  are  four  basic  steps  in  extracting  straight  lines: 

1.  Group  pixels  into  edge-support  regions  based  on 
similarity  of  gradient  orientation.  This  allows  data 
directed  organization  of  edge  contexts  without 
commitment  to  masks  of  a  particular  size. 

2.  Approximate  the  im^nsity  sc. face  by  a  weighted 
planar  fit.  The  fit  is  weighted  by  the  gradient 
magnitude  associated  with  the  pixels  so  that 
intensities  in  the  steepest  part  of  the  edge  will 
dominate. 


3.  Extract  attributes  from  the  edge-support  region  and 
the  plane  fit.  The  attributes  extracted  include 
the  representative  line  and  its  length,  contrast, 
width,  location,  orientation,  and  straightness. 

4.  Filter  on  the  attributes  to  isolate  various  image 
events  such  as  long  straight  lines;  high  contrast 
short  lines  (heavy  texture);  low  contrast  short  lines 
(light  texture);  and  lines  at  particular  orientations 
and  postitions. 

22  Grouping  Pixels  into  Edge  Support  Regions  VU 
Gradient  Orientation 

Figure  1  shows  two  representative  images  used  to 
illustrate  the  process.  Figure  2(a)  is  a  surface  plot  of  a  32x32 
subimage  from  another  house  image;  results  for  the  full  images 
are  shown  in  subsequent  sections.  The  vector  field  drawn  in 
figure  2(b)  shows  the  corresponding  gradient  image  where  the 
length  of  the  vector  encodes  gradient  magnitude.  The  gradient 
estimates  were  formed  by  convolving  the  image  with  two-by-two 
edge  masks  (figure  2(b)  inset).  Note  that  the  sign  of  the 
gradient  is  relevant. 

An  extremely  simple  process  was  employed  to  group  the 
local  gradients  into  regions  on  the  basis  of  the  orientation 
estimates.  The  360  degree  range  of  gradient  directions  is 
arbitrarily  partitioned  into  a  small  set  of  regular  intervals,  say 
eight  45  degree  partitions  or  sixteen  225  degree  partitions.  If 
our  conjectures  about  edge  orientation  are  correct,  then  pixels 
participating  in  the  edge-support  context  of  a  straight  line  will 
tend  to  be  in  the  same  edge  orientation  partitions  and 
adjacent  pixels  that  arc  not  part  of  a  straight  line  will  tend  to 
have  different  orientations.  A  simple  connected  components 
algorithm  can  be  used  to  form  distinct  region  labels  for  groups 
of  adjacent  pixels  with  the  same  orientation  label  (Figure  2(c)). 
Note  that  in  Figure  2(c)  the  great  degree  of  fragmentation  into 
many  small  regions  of  very  low  gradient  magnitude  could  be 
grouped  into  a  homogeneous  region  later,  rather  than 
interpreting  them  as  edge  elements. 

To  make  the  fixed  partition  technique  more  sensitive  to 
edges  of  any  orientation,  the  current  algorithm  uses  two 
overlapping  sets  of  partitions,  with  one  set  rotated  a 
half-partition  interval.  Thus,  if  a  45  degree  partition  is  used 
starting  at  0  degrees,  then  a  second  set  of  45  degree  partitions 
starting  at  725  is  also  used.  The  critical  problem  of  this 
approach  is  merging  the  two  representations  in  such  a  way  that 
a  single  edge  is  principly  associated  with  a  single  gradient 
region.  The  following  scheme  is  used  to  select  such  regions  for 
each  edge:  first  the  lengths  are  determined  for  the  regions; 
then,  since  each  pixel  is  a  member  of  exactly  two  regions  (one 
in  each  segmentation),  the  pixel  decides  which  one  provides  the 
longest  interpretation;  finally  each  region  counts  up  the  number 
of  pixels  within  its  boundaries  that  voted  for  it  as  opposed  to 
regions  of  the  other  segmentation.  The  support'  a  region  is 
given  is  the  number  of  votes  for  it  over  the  total  number  of 
pixels  in  the  region.  The  regions  selected  are  these  that  have  a 
majority,  ie.,  the  support  is  greater  than  50%.  For  further 
discussions  on  grouping  see  fBUR84). 


23  Interpreting  the  Edge-Support  Region  as  a 
Straight  Line 

The  underlying  intensity  surface  of  each  gradient  region 

is  a  candidate  for  a  straight  tine  structure;  the  key  problem  is 
to  use  this  information  to  find  the  line.  The  positional 
parameters  extracted  will  3erve  as  the  core  of  the  structure's 

symbolic  description  as  well  as  a  coordinate  system  about  which 
other  attributes  will  be  measured. 

In  this  section,  we  will  examine  a  simple  process  for 
computing  the  parameters  of  a  planar  fit  to  the  intensity 

surface  of  the  pixels  in  each  edge-support  region.  The  region 

depicted  in  figure  3(a)  and  as  dots  in  the  surface  plot  of  figure 
2(a)  will  serve  as  our  example.  Note  that  it  includes  pixels 
outside  the  group  of  gradients  depicted  in  figure  2(c),  since  the 
two-by-two  masks  incorporated  them  in  the  gradient  estimation. 

Haralick  fHAR81]  modelled  the  local  intensity  surface  in  the 
neighborhood  of  a  pixel  as  a  planar  surface  patch  called  a 
sloped  facet'.  This  planar  fit  served  as  a  model  of  the 

region  structure  and  was  used  to  determine  if  the  pixel  was  at 
a  region  boundary  or  not.  In  our  application  the  planar  fit 
will  be  applied  to  all  pixels  in  a  support  region  instead  of  an  a 
priori  fixed  geometric  configuration.  If  a  direct  least-square 
planar  fit  to  all  pixels  in  the  support  region  is  computed,  then 
many  pixels  which  might  be  at  the  tail  of  the  intensity  change 
could  dominate  the  fit.  Therefore,  the  pixels  were  weighted 
by  local  gradient  magnitude  to  enhance  the  effect  of  points 
near  the  edge  center. 

An  obvious  constraint  on  the  orientation  of  the  line  is 
that  it  be  perpendicular  to  the  gradient  of  the  fitted  plane. 
Thus,  this  leaves  the  problem  of  locating  the  line  along  the 

projection  of  the  gradient.  A  simple  approach  is  to  intersect 
the  fitted  plane  with  a  horizontal  plane  representing  the  average 
intensity  of  the  region  weighted  by  local  gradient  magnitude  as 
shown  in  Figure  3(b);  the  straight  line  resulting  from  the 

intersection  of  the  two  planes  is  shown  in  Figure  3(a). 

24  Extracting  Attributes  of  the  Support  Context 

The  gradient  region  and  the  planar  fit  of  the  associated 
intensity  surface  provides  the  basic  information  necessary  to 
quantify  a  variety  of  attributes  beyond  the  basic  orientation  and 
position  parameters.  Length  is  simply  the  distance  between  the 
two  endpoints.  Other  attributes  of  the  line  include  properties 
of  the  intensity  profile  perpendicular  to  the  line  and  its 
behavior  along  the  length  of  the  line.  Analysis  of  the  profile 
of  the  line  can  provide  a  measure  of  the  edge's  contrast  and 
width  (fuzziness),  while  behavior  along  the  length  determines  it's 
straightness;  see  [BUR84]. 

3.0  EXPERIMENTAL  RESULTS 

The  algorithm  described  in  the  preceding  sections  was 
applied  to  the  full  images  shown  in  Figure  1.  The  algorithm 
utilized  overlapping  partitions  as  described  in  Section  2.2;  the 
partition  size  was  45  degrees,  staggered  by  225  degrees. 
Figures  4-5  demonstrate  the  performance  of  the  algorithm. 


Figure  4(a)  shows  the  unfiltered  output  of  the  algorithm 
applied  to  the  first  house  image.  Note  that  all  of  the  small  and 
low  contrast  edges  are  still  present.  Figures  4(b-c)  show  the 
result  of  filtering  4(a)  on  the  basis  of  gradient  steepness  (change 
in  gray-lcvcls  per  pixel)  followed  by  a  filtering  on  length  that 
separates  the  edges  into  two  disjoint  sets,  one  corresponding  to 
short  texture  edges  (Figure  4b)  and  the  other  to  longer  lines 
related  to  the  surface  structure  of  objects  in  the  image  (Figure 
4c).  We  are  also  examining  ways  in  which  texture  descriptors 
may  be  constructed  from  the  edge  set  remaining  when  a  filter 
similar  to  that  which  produced  4(c)  is  applied  to  the  initial 

edge  data.  In  Figure  4c,  the  structural  edges  representing  the 

telephone  wires  were  extracted  from  a  thin,  one  pixel  wide 

diagonally  oriented  region,  a  difficult  problem  for  many  line 
extraction  processes. 

Figure  5  shows  results  on  another  typical  house  image, 
this  time  filtered  on  length  alone  (all  edges  greater  than  5 

pixels). 


4.0  CONCLUSIONS 

This  paper  has  presented  a  low-level  representation  for 
straight  lines.  The  technique  for  extracting  straight  lines  is 
effective  because  it  globally  organizes  the  spatial  extent  of  a 
straight  line  without  local  decisions  about  the  meaningfulness  of 
an  edge  feature.  It  does  this  by  utilizing  gradient  orientation 
to  provide  a  gradient  segmentation  of  the  pixels  in  the 
formation  of  edge-support  regions.  Analysis  of  the  intensity 
surface  of  the  pixels  in  these  regions  yields  the  information 
required  to  extract  lines  and  characterize  the  intensity  variations 
in  a  variety  of  ways.  The  algorithm  is  very  robust  and 
accurately  extracts  many  low  contrast  long  lines. 


While  the  extracted  lines,  such  as  long  straight  lines, 
might  be  directly  useful,  the  underlying  edge-support  region  has 
a  wealth  of  information  useful  to  intermediate  processing 
strategies.  These  include  additional  grouping  mechanisms  for 
linking  co- linear  straight  line  segments  and  linking 
pieccwise-linear  approximations  to  curved  lines  bounding  areas  of 
similar  properties.  It  also  contains  information  useful  for 
grouping  lines  with  common  properties  into  textured  regions. 
The  representation  can  serve  as  a  simple  yet  rich  edge-line 
"primal  sketch"  (MAR 77],  The  edge-support  regions  might  be 
useful  in  separating  the  straight  lines  into  intrinsic  images 
[BAR78]  representing  boundaries  of  different  types  such  as 
illumination,  texture,  reflectance,  orientation,  etc. 
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Figure  1.  Two  images  used  to  demonstrate  the  straight  line  finder. 
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Figure  2.  Forming  gradient  regions,  (a)  A 
to  illustrate  the  process,  (b)  The  2x2  operato 
local  gradient  orientation  is  obtained  and 
formed  by  a  regular  partitioning  of  the  data 
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A8STRACT 

Many  object  are  recognizable  by  outlines  of  their  two 
dimensional  projections  If  there  are  distinct  features,  the 
position  and  orientation  can  be  determined  even  when 
there  are  substantial  occlusions  or  additional  touching 
objects  Contour  matcnmg  has  been  used  for  a  variety  of 
tasks  by  other  researchers  We  present  a  simple  algorithm 
for  matching  linear  representations  of  closed  or  almost 
closed,  boundaries  of  obiects  Arbitrary  changes  in 
orientation  and  position  are  allowed  along  with  occlusions 
Unlike  many  other  methods,  no  relaxation  (or  iterative 
updating  of  the  match  rating)  is  necessary  A  complete 
system  which  uses  multiple  resolution  representations 
(currently  three)  has  been  implemented  and  tested  on  a 
variety  of  scenes  The  results  for  the  general  matching 
problem  (determine  if  two  contours  can  match  and  how  to 
transform  them  to  best  match)  are  very  good  Further 
work  remains  in  using  this  method  for  identifying  very 
similar  objects,  but  for  distinct  obiects  it  currently  works 
very  well. 

1  INTRODUCTION 

Two  dimensional  proiections  of  objects  are  sufficient  for 
many  recognition  tasks  In  industrial  automation 
applications,  many  obiects  have  only  a  few  stable 
positions  For  sequences  of  images  of  dynamics  scenes, 
the  2-d  proiection  of  the  objects  does  not  change 
significantly  from  one  view  to  the  next 

The  outlines  of  the  object  often  provide  sufficient 
information  for  recognition  With  occlusions  and  missir 
pieces,  we  can  expect  only  a  portion  of  the  outline  to 
match  in  two  views  of  an  obiect  In  this  paper  we  will 
concentrate  on  using  closed  outlines  of  objects  for 
matching,  but  the  mam  ideas  should  apply  to  matching 
with  long  curves  which  correspond  to  only  a  portion  of  an 
obiect 

A  variety  of  contexts  have  been  used  to  study  the 
problems  of  matching  closed  boundaries  There  is  a 
series  of  papers  [1,2,3]  which  report  on  a  different 
techniques  for  recognition  and  motion  analysis  Chow  and 
Agarwal  [1]  used  contour  matching  for  studies  of 
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simulated  closed  patterns  using  polygonal  figures.  McKee 
and  Agarwal  [2]  looked  at  matching  with  only  partial  views 
of  objects.  They  developed  a  measure  to  define  how  well 
two  curves  match  Martin  and  Agarwal  [3]  used  curved 
boundaries  in  their  studies  of  dynamic  scenes  Davis  [4] 
and  Davis  and  Henderson  [5]  explored  the  use  of  relaxation 
matching  for  shape  analysis.  The  first  paper  recognized 
islands  based  on  their  outlines,  and  the  second  combined 
relaxation  and  syntatic  methods  for  recognition  Bhanu  [6] 
looked  at  contour  matching  of  closed  contours  using  a 
relaxation  technique  on  the  piecewise  linear  representation 
of  the  boundary  curve  Ayache  [7]  has  given  a  method  for 
accurately  locating  an  object  in  a  seen-'  where  there  is 
substantial  occlusion  (or  additional  metal  on  the  object 
-  metal  castings)  using  matches  for  key  segments  to  force 
the  locations  of  other  segments.  Line  segment  matching, 
without  considering  closed  contours,  has  been  studied  by 
Clark  et  al.  [81  for  aerial  views  and  Mediom  [9]  for  stereo 
pairs.  Here  we  have  ignored  the  use  of  global  boundary 
descriptions  since  they  tend  to  not  work  well  with 
occlusions  and  missing  parts 

The  matching  method  we  present  here  is  an  attempt  to 
be  more  general  in  terms  of  the  type  of  potential  tasks 
and  computationally  simpler  than  previous  methods  The 
previous  work  has  shown  that  corresponding  segments  in 
two  views  can  be  computed  reliably  and  used  for 
recognition  or  object  matching.  Also,  it  has  shown  that 
some  correct  corresponding  segments  can  be  located  by 
simple,  rotation  invariant,  features  of  the  segments,  but 
these  features  give  many  extra  matches  The  most 
important  consideration  for  matching  closed  contours  is 
that  the  order  of  segments  in  the  two  views  must  be  the 
same  (or  in  strictly  reverse  order  if  mirror  images  are 
allowed)  This  last  property  was  important  in  [2]  and  [3] 
and  was  also  used  in  [4],  [5],  [61,  and  [7], 

2.  ALGORITHM  DESCRIPTION 

Rotation  invariant  features  of  line  segments  include  the 
line  segment  length  and  the  angle  between  consecutive 
segments.  Using  only  these  features,  a  given  line 
segi  ent  in  one  view  can  readily  match  many  segments  in 
another  view  But  a  consecutive  sequence  of  border 
segments  in  one  view  should  have  few  matches  with 
consecutive  or  monotonically  increasing  sequences  in  the 
other  view  Rather  than  comparing  sequences  of  boundary 
segments,  we  will  compute  the  potential  matches  and  look 
for  sequences. 
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2  1  Boundary  Descriptions 

Ttie  images  are  of  a  variety  of  small  tools  on  a  light 
table  for  a  good  contrast  but  the  clear  handles  of  some  of 
the  tools  do  not  always  have  a  sufficient  contrast  between 
the  objects  and  the  background  The  obiects  are  extracted 
using  a  simple  threshold  to  separate  them  from  the  bright 
background  The  boundary  of  the  region  is  transformed 
into  a  sequence  of  straight  line  segments  by  a  procedure 
originally  developed  for  sequences  of  edge  points  [10] 
Figure  1  shows  the  segment  representation  for  two  views 
of  the  nine  obiects  The  accuracy  of  the  line  segment 
representation  is  controlled  by  a  parameter  which  gives 
the  allowable  deviation  of  the  individual  points  from  the 
straight  line  segments  Three  representations  are 
generally  computed  corresponding  to  a  deviation  of  1.2  (1 
has  an  anomalous  behavior).  2  and  4  Each  boundary  is 
represented  as  an  ordered  sequence  of  these  line 
segments  with  length  and  orientation  for  each  Figure  2 
shows  the  two  lower  resolution  segment  representations 
for  the  image  in  Fig  la  Segments  are  numbered  in  order 
around  the  boundary,  thus  consecutive  segment  numbers 
correspond  to  adjacent  segments  in  the  boundary  These 
segment  sequences  are  called  families  Each  family 
corresponds  to  one  region  boundary  with  separate  families 
for  the  interior  boundaries  of  holes  Several  families  are 
included  in  the  description  for  an  image,  with  each  family 
matched  separately 
2  2.  Initial  Matching  Segments 
The  matching  procedure  considers  two  families  at  a 
time,  one  from  each  of  two  images  Each  segment  m  the 
first  family  is  compared  with  each  in  the  second  family  to 
determine  if  they  can  possibly  match  If  they  do  match, 
then  the  orientation  difference  is  stored  in  an  array 
(indexed  by  segment  numbers),  called  a  disparity  array 
Possible  matches  are  determined  by  comparing  the 
segment  lenqths  and  by  comparing  the  angles  between 
the  current  segments  and  their  respective  successors 
Both  of  these  tests  use  a  threshold  chosen  by  the  user 
The  length  threshold  is  a  multiplicative  factor  greater  than 
1  thus  we  can  use  the  test 

Ml)  t  <  L(2)  <  L(1)'t 

where  '(X)  is  the  length  of  segment  X  and  t  is  the 
threshold  At  this  point  the  length  restriction  is  severe 
around  l  !  The  angle  difference  threshold  depends  on 
the  resolution  of  the  line  segment  representation,  ranging 
from  90°  for  the  lowest  resolution  version  to  45°  for  the 
highest 

Each  segment  will  have  many  possible  matches  using 
these  two  criteria,  but  there  should  be  very  tew  cases 
where  several  consecutive  segments  in  one  image  match 
consecutive  .egments  in  the  other  all  with  similar 
orientation  differences  Figure  3  is  an  example  of  the  data 
m  the  disparity  array,  values  are  not  given,  just  the 
locations  with  matches.  Thus,  we  need  to  find  long 
consecutive  sequences  of  matching  segments  As  the  first 
step,  we  find  two  matches,  n  with  m  and  n+1  with  m»1 
These  two  matches  are  used  as  the  starting  point  for  a 
simple  search  to  find  a  long  sequence  of  matches  where 
gaps  in  either  sequence  are  allowed 


2.3.  Initial  Matching  Sequence 

From  the  pair  of  initial  matches  we  search  for  the  next 
(or  previous  by  searching  backwards)  matching  segments 
using  the  disparity  array  We  look  at  the  diagonally 
adjacent  segments  next  (increment  both  by  one)  then  the 
points  off  the  diagonal  In  the  following 


X  is  the  first  point  located  and  Y  the  second  The  search 
starts  at  1  then  continues  at  2.  3,  etc,  after  9  we  continue 
at  0.  1  This  search  continues  until  another  matching 

pair  is  located  which  has  an  orientation  difference  close  to 
that  of  the  first  pair  The  threshold  for  close  is  the  same 
as  was  used  as  a  limit  on  the  difference  between  the 
angles  of  segments  and  their  successors  Gaps  between 
the  last  match  and  the  new  match  are  filled  in  as  matches 
when  the  new  match  occurs  along  the  diagonal,  i  e  both 
sequences  skipped  the  same  number  of  segments 
All  possible  sequences  are  located  in  the  two  families  If 
the  longest  sequence  has  enough  points  (5  for  cases 
where  good  matches  are  expected  3  as  an  extreme  where 
nothing  is  known)  then  this  set  of  matching  segment  pairs 
is  used  to  determine  the  approximate  transform  to  align 
the  two  families  If  no  long  sequence  is  found,  then  there 
is  no  match  for  these  two  families  The  transformation  is 
one  which  will  perfectly  align  a  pair  of  matching 
segments  This  pair  is  the  one  near  the  median, 

orientation  difference  (computed  using  the  length  of  the 
segments  as  weights),  which  is  longest  and  where  both 
segments  of  the  pair  are  close  to  the  same  length  That 
is,  starting  at  the  median  look  for  the  longest  segment 
where  the  ratio  between  the  segment  lengths  (short/long) 
is  greater  than  0  8  The  best  two  transformations  are  used 
for  the  transformation  refinement  (along  with  the  best 
transformation  from  the  second  longest  matching 
sequence  -  if  it  is  close  (0  7)  to  the  length  of  the  longest 
one)  This  transformation  only  applies  for  mapping 
between  the  given  pair  of  families,  there  will  be  many 
such  transformations  in  the  final  complete  match 


2.4.  Transformation  Refinement 

Two  (or  possibly  three)  transformations  for  a  pair  of 
families  have  been  generated  which  must  be  compared  to 
select  the  best  With  a  known  transformation,  we  use 
different  constramts  for  computing  matching  segments  A 
possible  match  is  indicated  if  the  segments  overlap  in 
position,  or  nearly  overlap,  and  the  orientations  are  similar 
(90°  for  short  segments  and  20°  for  long  ones)  -  after 
the  transformation  has  been  applied  to  the  segment  in  the 
first  image  Figure  4  shows  the  matches  for  the 
transformation  generated  by  the  longest  and  second 
longest  sequence  from  the  data  in  Figure  3. 

Using  each  of  the  possible  transformations,  we  compute 
the  set  of  initial  matching  segments  and  find  the  longest 
sequence  of  matching  segments  by  the  same  procedures 


V  *.  /.  /. 


as  above  A  disparity  value  (Euclidian  distancel  is  stored 
m  the  disparity  array  and  is  used  in  the  search  for  long 
sequences  Since  a  transformation  is  applied  to  segments 
m  the  first  view  the  disparities  should  all  be  near  2ero, 
but  the  matching  ideas  here  are  more  general  and  can  be 
used  in  .>  stereo  problem  where  there  is  no  orientation 
transformation  and  similar  disparities  are  used  to  separate 
various  possibilities.  The  search  for  long  sequences 
allows  wrap  around  matches  -  if  one  sequence  hits  the 
end  of  the  sequence  it  can  start  over  at  the  beginning 
while  the  second  only  increments  by  one 

2  5  Hierarchical  Matching 

Multiple  resolution  segment  representations  help  improve 
speed  and  accuracy  The  time  for  matching  of  two 
families  depends  on  the  number  of  segments  in  the  two 
families,  but  the  alignment  is  best  when  the  segments 
very  closely  follow  the  contours  of  the  obiect  We  apply 
the  two  step  matching  procedure  to  the  lowest  resolution 
representation  and  obtain  a  set  of  matches  for  many  of 
the  families  At  the  next  higher  resolution  we  use  the 
known  transformation  as  the  starting  point  and  apply  the 
transformation  refinement  operation  twice  (The  second 
step  primarily  finds  which  segments  match  with  the 
updated  transformation  rather  than  a  more  accurate 
transform  and  is  primarily  for  display  purposes.)  Families 
which  had  no  match  at  the  lower  resolution  are  processed 
the  same  as  at  the  lowest  resolution  -  find  initial  matches 
using  the  length  and  angle  with  the  successor,  then  refine 
the  match  using  position  and  orientation 

2  6  Matching  Summary 

In  summary  the  matching  procedure  can  be  described  as 
two  passes  of  two  processing  steps  applied  to  each  pair 
of  families 

Pass  1,  step  1  Compute  likely  corresponding  segments  by 
comparing  all  segments  with  all  others  Use 
segment  length  and  the  angle  between  a  segment 
and  its  successor  to  determine  the  match 
Pass  1  step  2  Locate  sequence  of  corresponding 
segments  where  the  segment  number  increases 
monotomcally  m  each  image  Use  these  sequences 
to  determine  a  good  transformation  to  map  one  set 
into  the  other 

Pass  2  step  1  Using  the  transformation  compute  a  new 
set  of  likely  matching  segments  using  segment 
position  and  orientation 

Pass  2  step  2  Locate  sequences  of  monotomcally 
increasing  seqments  and  determine  a  new 
transformation 

For  multiple  resolution  data.  Pass  2  is  repeated  as  Pass  3 
and  4.  to  determine  corresponding  segments  at  the  higher 
resolution  and  yet  a  better  transformation 

3  RESULTS 

This  matching  program  is  intended  to  be  somewhat 
general  it  answers  the  two  questions  Can  these  two  sets 


of  segments  be  forced  to  match?  What  transformation  will 
align  the  view  in  the  first  image  with  the  second’ 
Because  we  wish  the  program  to  work  even  with 
occlusions,  the  program  will  indicate  a  match  when 
presented  two  partially  similar  obiects  If  tne  task  is 
recognition  then  an  evaluation  of  the  match  quality  is 
necessary  to  determine  which  identification  is  best  In  this 
paper  the  results  are  for  a  basic  matching  task,  not 
specifically  recognition,  but  we  do  evaluate  the  matches 
and  eliminate  those  which  are  much  worse,  based  on 
number  of  matches,  total  disparity  after  transformation, 
total  orientation  differences,  and  total  successor  angle 
difference  than  others  for  the  same  family 

The  input  images  are  of  a  set  of  tools  (two  pairs  of 
pliers,  two  small  screwdrivers,  one  longer  one  with  a 
similar  handle,  one  large  screwdriver  and  one  short,  fat 
one)  A  mechanical  pencil  and  a  fountain  pen  were  also 
included  These  last  two  had  fewer  segments  in  the 
representation  and,  in  some  cases  appeared  as  mirror 
images  and  thus  did  not  match  as  reliably  Two  views  of 
all  nine  objects,  with  no  occlusions,  were  taken,  plus  two 
more  views  of  a  subset  of  the  obiects  and  six  other  views 
with  a  variety  of  occlusions  The  exact  segment  to 
segment  match  is  not  important  since  some  segments 
only  partially  match,  therefore,  we  will  present  the  results 
as  outlines  taken  from  the  first  images  transformed  to  line 
up  with  the  objects  in  the  second  image 

Figure  1  shows  the  outline  of  the  two  images  with  all 
objects  and  no  occlusions  These  two  images  are 

matched  with  ail  the  others  (including  with  each  other)  in 
our  experiments  Even  though  the  images  were  digiwed 
on  a  light  table  to  obtain  near  perfect  outlines  in  some 
cases  the  clear  handles  of  the  screwdrivers  cause 
problems.  Figures  5-10  show  some  of  the  results 
-  selected  to  show  successes  and  problems 
3.1  Evaluation 

The  program  locates  most  of  the  correct  matches  and 
many  of  the  extra  matches  are  with  very  similar  obiects 
The  differences  between  the  two  small  screwdrivers  are 
very  minor  and  the  handle  of  the  long  bladed  screwdriver 
is  almost  the  same  as  the  two  small  ones  so  these  three 
often  match  all  three  possibilities  (see  Figs  5  and  6)  In 
Fig.  5.  there  are  three  good  matches  for  each  of  the  two 
small  screwdrivers  and  two  for  each  of  the  larger  ones 
These  are  valid  since  the  handles  are  very  similar  Both  of 
the  pliers  in  one  image  match  with  both  in  the  other  since 
then  handles  are  very  similar  Many  of  the  'incorrect' 
matches  can  be  eliminated  by  choosing  only  the  best 
match  but  this  also  means  some  correct  matches  are 
missed  When  two  similar  obiects  occlude  each  other  the 
match  for  both  may  be  with  the  same  sequence  In  Fig  7 
both  pliers  match  with  one  sequence  because  this  was  the 
best  match  at  the  lowest  resolution  The  same  is  true  of 
the  group  of  three  screwdrivers  where  all  matches  are 
with  only  one  of  the  obiects  Figures  8  and  9  show  that, 
in  some  cases,  the  overlap  of  similar  obiects  (the  pliers) 
does  not  hurt  the  match  Round  obiects  can  cause 
difficulties  (even  when  there  are  small  well  defined  ears') 
since  many  different  rotations  will  give  equally  good 
matchs  We  show  no  examples  here,  but  we  encountered 
this  problem  on  an  earlier  similar  data  set  and  mention  it 


as  a  known  problem 

This  matching  procedure  is  reasonably  efficient,  with  the 
total  time  depending  on  a  number  of  factors  -  primarily 
the  number  of  segments  in  the  representation  For 
example,  the  matching  for  image  1  (Fig  la)  with  image  2 

(Fig  1b)  (see  Fig  5  for  the  results)  takes  about  2  minutes 

8  seconds  This  includes  matching  at  three  resolutions  for 

9  objects  in  each  image  Approximately  75%  of  the  time 

is  in  computing  the  likely  matches  and  20%  in  searching 
for  sequences  of  matches  or  computing  the 

transformation  The  times  for  the  higher  resolutions  are 
not  significantly  greater  than  the  lowest  resolution 
because  the  matching  is  restricted  to  refining  the  existing 
matches,  not  searching  for  new  ones  except  for  the  few 
unmatched  objects  The  lowest  resolution  match  uses  116 
and  119  segments  from  the  first  and  second  view, 
respectively,  and  compares  9  families  in  one  view  with  all 
9  in  the  second  (ie.  test  the  match  for  81  possible 

combinations)  This  requires  a  total  of  41  seconds  for  all 
81  comparisons  The  implementation  is  on  a  PDP-10  with 
no  special  effort  for  low  level  efficiency 

4.  CONCLUSIONS 

We  have  proposed  a  simple  relatively  efficient  matching 
procedure  for  comparing  contours  in  scenes  containing 
occlusions  and  multiple  obiects  which  requires  no 
iterative  updating  (relaxation).  This  procedure  uses  the 
order  of  segments  around  a  boundary  as  the  most 
important  criterion  for  determining  whether  individual 
segments  match.  There  are  still  some  open  problems 
which  are  also  problems  for  any  other  existing  system 
These  include  how  to  evaluate  several  different  matches 
for  use  in  a  recognition  system  with  a  variety  of  similar 
objects  Another  problem  is  the  uniform  use  of  holes 
which  give  multiple  segment  sequences  for  one  region, 
and  efficient  handling  of  almost  circular  objects 
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Figure  la  Outlines  ol  ihe  first  view  of  all  nine  objects 


Figure  1b  Outlines  of  Ihe  second  view  of  all  nine  obiects 


172 


Figure  2a  Segments  for  a  deviation  threshold  of  2 


Figure  2b  Segments  for  a  deviation  threshold  of  4 
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Figure  4a  The  longest  sequence  corresponds  to  the 
incorrect  pliers 
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Figure  4b  The  second  longest  sequence  corresponds 
to  the  correct  pliers. 


Figure  4  The  matching  segments  for  the  two  longest 
sequences  found  in  the  data  of  Fig  3.  Transformations 
have  been  applied  so  that  the  two  views  are  aligned. 


Figure  3.  Matching  segments  for  initial  step  Each  row 
corresponds  to  a  segment  in  the  first  view  each 
column  *o  a  segment  in  the  second  view.  The  first 
view  is  for  the  pliers  in  the  lower  right  of  Fig  la.  the 
second  view  is  in  Fig.  8a. 
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Figure  5  Final  matching  results  tor  the  objects  in  Fig 
la  with  those  in  Fig.  1b  The  outlines  from  the  objects 
in  Fig  la  are  transformed  to  line  up  with  the  objects  in 
Fig  1b.  and  the  displayed 


Figure  7a  Outlines  of  the  objects  in  test  image  9. 


Figure  6  Final  matching  results  for  matching  Fig  1b 
with  Fig.  la. 


Figure  8a  Outlines  of  the  objects  in  test  image  6 


Figure  7b  Final  matching  results  for  matching  Fig  la 
with  Fig.  7a. 


Figure  8b  Final  matching  results  for  matching  Fig.  la 
with  Fig  8a 
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ABSTRACT 

A  negative  answer  is  provider!  to  a  question  related  to  a 
proposal  or  liinford  for  edge  localization  in  position  and  angle 
simultaneously  by  linding  zero  crossings  of  “stacked  planes”  of 
directional  convolutions.  Mathematical  background  is  provided, 
leading  to  use  of  the  inverse  function  theorem  as  the  main  tool 
of  proof. 

INTRODUCTION 

One  way  to  localize  edges  is  by  finding  zero  crossings  of  a 
convolution  operator.  This  method  yields  a  precise  value  for, 
say,  the  z-coordinatc,  hut  to  determine  the  orientation  of  the 
edge  requires  further  processing,  c.g.  using  a  number  of  oriented 
operators  (which  may  disagree  as  to  the  z-posilion)  or  by  ob¬ 
serving  the  locus  or  zero  crossings.  An  integrated  method  of 
extracting  the  position  and  orientation  would  be  preferable. 

(liinford  1981)  proposes  localizing  edges  in  direction  and 
orientation  simultaneously  by  convolving  a  lateral  inhibition  sig¬ 
nal  with  a  directional  operator  and  viewing  the  results  as  a  set  of 
“slacked  planes,”  one  for  each  orientation  of  the  operator.  The 
estimate  for  the  edge  would  be  based  on  linding  maxima  of  the 
gradient  of  the  lateral  inhibition  signal  with  respect  to  position 
and  angle,  by  seeking  zero  crossings  of  the  partial  derivatives. 
Since  all  of  the  operations  prior  to  linding  zeroes  would  be  imple¬ 
mented  as  convolutions,  it  is  the  zero  crossings  or  the  resulting 
convolutions  which  are  sought.  As  ultimately  staled  in  [Minl'ord 
1981],  2  convolutions,  corresponding  to  2  partial  derivatives  must 
be  considered.  However,  it  is  natural  to  ask  lirsl  whether  this 
can  be  accomplished  by  linding  the  zero  crossings  or  a  single 
convolution.  In  the  following,  we  show  that  this  is  impossible, 
using  the  inverse  function  theorem  in  what  is  essentially  a  dimen¬ 
sionality  argument.  This  is  why  it  is  necessary  for  (liinford  1981) 
to  require  the  use  of  2  convolutions. 

SOME  MATHEMATICS  OF 
PARAMETRIC  CONVOLUTIONS 

Let  F  :  R!  -t  R  br  a  picture  function,  and  /(  :  R2  -* 
R  a  convolution  kernel,  that  we  also  refer  to  as  a  convolution 
operator.  The  normal  definition  of  the  convolution  K  *  F  is 

K  »  F( z,y)  =  J  K(i-(,,y-  v)F((,ri)  d^dt), 
or,  in  vector  notation 


•This  work  was  supported  in  part  by  AltPA  rontracts  MDA90.1-80  C-0102 
ami  NOOOS9-82-  ('-0250. 


K  *  F(x)  =  J  ^  K(x  —  C)F(()  dA 

We  want  to  use  a  more  abstract  notation  Tor  this,  so  that 
we  can  generalize  it  slightly  in  a  transparent  way. 


Fig-  (P) 


Let  g  :  R2  — *  R2  be  an  invertible  map,  c.g.  a  rotation 
or  translation  or  the  plane,  and  let  K  :  R2  -»  R.  To  describe 
“doing”  g  to  K,  delinc  the  map  7",  :  7(R2)  -*  7(R2),  where 
J( R 2 )  is  a  space  of  functions  R2  — *  R,  by 

T,(K)=Kog-' 

Observe  that  7’,ofc(/C)  =  K  o  (g  o  h)~'  =  K  o  h~'  o  g_l, 
so  the  argument  transformations  “go  in  reverse  order"  from  the 
space  transformations.  Notice  also  the  interesting  Tact  that  Tt 
is  a  linear  map,  even  if  K  and  g  are  not.  Proof:  Tg(<xK  +  /,)  = 
(aK  +  /,)  o  g-'. 

In  particular,  let  C  be  the  translation  group  of  R2,  where 
t*  6  C  is  defined  by 

r,  :  R2  -»  R2 
pr-r  p+X 

T, (/f)  is  what  you  get  when  you  “move  K  by  gf  the  right 
hand  side  of  its  definition  shows  how  to  calculate  the  new  K .  For 
a  translation  rIt  Tug(K )(p)  =  l< (p  -  z),  which  is  often  baffling 
for  beginning  students. 

Define  the  inversion  operator,  t,  by 

i :  R2  —  R2 

ZM-Z 
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r, 


Nolr  that  t  1  =  t,  and  that  in  R2,  inversion  is  the  same  as 
rotation  by  180°. 

Using  the  notation  Tor  inversion  and  translation,  the  con¬ 
volution  rormula  can  be  rewritten 

K  *  F  (z)  =  J  K  o  i  o  t J1  •  /''  dA 

where  x.  is  now  a  generic  point  ot  R2  and  dA  is  the  area  measure. 
Note  lor,'1  =  t,  o  t  =  (r,  o  t)-1 .  So,  using  the  T  notation, 

K  .  F  (i)  =  J  7’T.„.(K)  •  F  dA 

or,  abusing  the  notation  somewhat, 

K  •  /-’(i)  =  J  TX(K)  ■  F  dA 

We  can  make  the  notation  more  compact  by  using  the  I ?  inner 
product  (• ,  •),  defined  by  ( g ,  h)  =  /  gh  dA: 

K.F(x)  =  (Tt(K),F) 

We  can  define  a  rotation  operator,  p,  by 

/>*  :  R2  -»  R2 
re'v  i—*  re<(*,+*> 

Using  all  the  notations,  we  can  include  rotations  by  allowing 
G  to  be  the  rigid  motion  (Kuclidean)  groen  of  R2,  and  consider¬ 
ing  Inc  functions  defined  by 

S(K,F,z,0)  =  K,  .  /-’( i)  =  (7’(.i, ,(/<■),  F) 

=  (7V.O lo„{K),  F)  =  J  K  O  pj'  O  1~‘  O  T~l  ■  F  dA 


2)  Kor  each  y  6  II,  each  partial  D1f(x,y )  (taken  with  respect  to 
the  y-th  [/-variable)  is  in  /.' (p). 

3)  There  exists  a  function  J\  €  /.'(/i)  such  that  for  all  y  €  U, 

|Oy/(*,v)l  <  !/.(*)!• 


Ix:t 


Mv)  =  Jx  J(z,v)  M*)- 

Then  l)j  <t>  exisLs  and  we  have 


The  lemma  permits  us  to  conclude  the  following 

Theorem  [Lang  l%9|.  I/Ct  /  6  L1  and  <p  6  Cr ,  r  >  1  with 
compact  support.  Then  /  *<p  6  Cr  and  l)p(f  *  ip)  =  f  *  Dp<p  for 
P  <  r. 

Notice  that  this  means  that  no  matter  how  badly  behaved  / 
may  be,  f  *<p  is  as  dilTcrcntiablc  as  In  particular,  convolution 
with  a  C°°  function  results  in  a  C°°  function.  In  our  situation, 
if  either  the  picture  or  the  convolution  kernel  is  difTercntiablc 
with  respect  to  the  parameters  (we  may  interchange  the  two, 
allowing  the  symmetries  to  act  on  the  picture  if  it  suits  us),  then 
our  function  s,  the  convolution,  is  likewise  dilTercntiablc.  On 
the  other  hand  it  may  happen  that  both  the  kernel  and  picture 
contain  discontinuities,  e.g.  if  they  have  steps.  In  that  case, 
integration  by  parts  yields  the  fact  that  the  convolution,  i.e.  a, 
is  continuous. 


THE  LIMITATIONS  OF  ZERO-CROSSINGS 


We  are  interested  in  the  function  obtained  from  *  F(x)  by 
fixing  K ,  /•':  this  is  a  function  »  :  R“  X  S'  — ►  R,  i.e.  a  function 
or  z,0.  I.e.,  we  define  a  by  .s(x,0)  =  S(/\  ,  x,0).  It  is  the  z  ro 

crossings  of  .*<  which  we  are  seeking.  loot’s  underscore  the  role  of 
the  symmetry  group  G  in  the  definition  of  a.  The  construction 
we  used  to  define  a  actually  defines  a  map  a  :  G  — ♦  R.  In  fact, 
for  any  family  of  K' s  defined  by  some  map  M  ->  7(R2),  where 
M  is  the  indexing  set  for  the  family,  we  can  define  8  :  M  —*  R. 
So  one  can  easily  add  parameters,  e.g.  to  allow  different  size 
operators,  and  this  type  or  analysis  is  still  applicable. 

We  want  to  show  that  for  a  :  M 3  — ►  R  (where  M 3  is  a 
3-dimensional  manifold)  the  “zero  crossings”  cannot  be  an  edge 
locus.  To  do  this,  we  will  have  to  be  more  precise  about  what 
we  mean  by  “zero  crossing,”  and  wo  will  consider  separately  the 
cases  when  s  is  differentiable,  and  only  continuous.  The  2  cases 
can  be  analyzed  independently;  the  continuous  case  subsumes 
the  diffo  ntiablo  case,  but  since  the  differentiable  case  provides 
better  insight,  we  treat  it  first. 

To  begin  with,  we  make  some  observations  about  when  a 
is  continuous  or  continuously  differentiable.  Here  is  Lemma  2  of 
Ch.  XIV  §1  (p.  375)  or  [Lang  I960]. 

Lemma.  Let  X  be  a  measured  space  with  positive  measure 
/i.  Ix>t  U  be  an  open  subset  of  Rn.  I,ct  /  be  a  function  on  X  X  U. 
Assume: 

I)  For  each  y  £  U  the  function  x  »-♦  /(z, y)  is  in 


Definition  of  zero  crossings 

If  a  is  C°  (continuous),  then  we  will  say  it  has  a  zero 
crossing  at  (z,j/,0)  if  the  functions  «(-,j/,0)  and  »(x,y,  •)  have  !- 
dimensional  zero  crossings  at  x  and  0,  respectively.  Colloquially, 
this  means  that  the  x  and  0  functions  have  zero  crossings.  We 
don’t  require  a  zero  crossing  in  the  y  direction,  because  it  may 
be  the  direction  of  the  edge.  We  will  say  that  /  :  R  — ►  R  has  a 
1- dimensional  zero  crossing  at  x  if  f(x)  =  0,  z  is  the  only  zero  in 
some  neighborhood,  and  /  has  opposite  signs  on  opposite  sides 
of  x  in  such  a  neighborhood. 

If  a  is  Crt  r  >  I,  then  we  will  say  that  a  has  a  zero 
crossing  at  (z,  y,0)  if  *(z,t/,0)  =  0  and  D|*(z,  y,0)  ^  0  ^ 
where  0,  indicates  the  derivative  with  respect  to 
the  i-Lh  coordinate.  Tims  (z,  y,  0)  is  a  regular  point  of  *,  which 
means  that  not  all  of  its  partials  arc  0  at  that  point.  This  implies 
the  (7°  definition  of  zero  crossing. 

Remarks  on  the  definition  of  zero  crossings 

The  picture  we  are  keeping  in  mind  has  the  edge  oriented 
along  the  y-axis.  The  definitions  seem  to  single  out  a  particular 
set  of  coordinates  asymmetrically,  to  keep  with  this  picture. 
However,  the  definitions  really  only  require  that  the  z  and  0 
axes  not  In'  oriented  along  the  edge;  equivalently  we  could  have 
required  that  some  set  or  coordinate  axes  with  these  properties 
exist. 
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The  condition  that  the  2  derivatives  be  nonzero  when  a  is 
Cr  is  there  for  2  reasons:  First,  that  the  definition  reduce  to 
the  C°  case,  which  is  intuitively  the  meaning  of  “zero  crossing.” 
And  secondly,  to  avoid  degenerate  cases,  e.g.  when  the  locus  of 
zeroes  is  a  submanifold  perpendicular  to  the  x-axis.  Note  that 
in  this  case,  a  zero  can  still  be  a  regular  point  of  a.  Conversely, 
even  ir  «  is  (?r,  the  C'°  condition  is  weaker,  since  e.g.  it  doesn’t 
exclude  tangent  crossings. 

Theorem,  a  :  R2  X  Sl  -*  R  cannot  have  an  isolated  zero 
crossing  in  either  or  the  above  senses.  (Ry  isolated  we  mean  there 
are  no  other  zeroes  in  the  z,0  manifold,  for  fixed  y.)  That  is, 
edges  cannot  be  localized  simultaneously  in  x  and  0  by  the  zero 
crossings  of  a  single  ( 0- parameter*! /<*d)  convolution  operator. 

Proof. 

Case  1:  a  of  class  Cf,  r  >  1 

Since  (x,  y,0)  is  a  regular  point  of  a,  the  implicit  function 
theorem  applies  and  in  some  neighborhood  or  (x, y,0),  a~'(0)  is 
a  Cr  submanifold  of  dimension  2.  The  conditions  on  the  partials 
guarantee  that  the  surface  is  not  normal  to  any  of  the  z,  y,  or 
0  axes,  so  that  Tor  fixed  y,  there  is  a  curve  of  (x,0)  values  for 
which  a(z,  y,0)  =  0,  so  that  the  zero  cannot  be  lomlixcd  in  z 
and  0  simultaneously.  A  more  direct  way  to  sc*  'hi  ■  observe 
that  what  we  are  seeking  is  a  function  a  whose  zero  crossings 
are  the  locus  of  an  edge.  Regarding  the  edge  as  a  function 
7  :  R  — ►  R2,  it’s  obvious  that  adding  orientation  loads  to  a 
function  X  :  R  — *  R*  X  S]  defined  hy  X(/)  =  (7(f), 0(f)),  where 
0(t)  is  the  orientation  or  the  edge  at  7 (/).  Since  the  image  of  X 
is  I -dimensional,  we  cannot  hope  for  it  to  be  the  inverse  image 
of  a  regular  value  of  a  map  to  the  reals,  since  hy  the  implicit 
function  theorem,  that  must  he  a  2-dimensional  object.  Rut  by 
the  same  token,  if  we  have  instead  a  :  R  l  — *  R2,  then  one  can 
try  to  find  edges  hy  finding  s_,(0).  QICI)  Case  l. 

Case  2:  a  of  class  C° 

We  restrict  attention  to  the  function  defined  on  the  x,0 
manifold,  and  show  that  every  zero  crossing  is  an  accumulation 
point  of  zero  crossings. 


Fig.  ( proof  1) 


Look  at  Fig.  (proof  1).  What  it  shows,  schematically,  is 
an  edge  operator  in  the  vicinity  of  an  edge,  and  the  result  of 
applying  some  motions  to  it.  The  positions  are  labelled  on  an 
arbitrary  scale.  The  (0,0)  position  is  where  the  zero  crossing  is. 
R  one  assumes  there  are  no  other  zeroes  in  some  neighborhood, 
the  indicated  operations  show  that  there  are  2  ways  to  get  to  the 
same  position  of  the  operator  with  opposite  signs  for  the  result, 
a  contradiction. 


A  more  abstract  picture  of  this  is  Fig.  (pro«f2),  which 
shows  a  region  of  the  z,0  manifold  near  the  zero  crossing.  In 
particular,  we  can  assume  without  loss  or  generality  that  moving 
up  (i.e.  rotating)  causes  a  to  become  +  (else  Hip  the  picture 
Lop  for  bottom),  while  moving  right  (translating)  causes  a  to 
become  —  (else  flip  right  for  left).  The  restriction  of  a  to  the  line 
joining  the  2  end  points  of  these  motions  must  have  a  zero,  by 
the  intermediate  value  theorem  (see,  e.g.  [Kudin  MM»||).  QED| 
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ABSTRACT 

Shape-from-shading  and  shape-from-texture  methods  have  the 
serious  drawback  that  they  are  applicable  only  to  smooth  surfaces, 
while  real  surfaces  are  often  rough  and  crumpled  To  extend  such 
methods  to  real  surfaces  we  must  have  a  model  that  also  apples  to 
rough  surfaces  The  fractal  surface  model  [Pentland  *3]  prov  ides  a  for¬ 
malism  that  is  comjietent  to  describe  such  natural  D  surface  and, 
in  addition,  i*  able  to  predict  human  perceptual  judgments  ol  smooth¬ 
ness  versus  roughness  thus  allowing  the  reliable  application  of  shape 
estimation  techniques  that  assume  smoothness  This  model  of  surface 
shape  ha*  been  used  to  derive  a  technique  for  3-D  shape  estimation 
that  treats  shading  and  texture  in  a  unified  manner 

I.  INTRODUCTION 

The  world  that  surrounds  us,  except  for  man-made  environments, 
is  typically  formed  *»f  complex,  rough,  and  jumbled  surfaces  Current 
representational  schemes,  in  contrast  employ  smooth,  analytical  primi¬ 
tives  e  g  .  generalized  cylinders  or  splines  —  to  describe  three- 
dimensional  shapes.  While  such  smooth-surfaced  representations  func¬ 
tion  well  in  man-made,  carpentered  environments,  they  break  down 
when  we  attempt  t-  describe  the  crenulated,  crumpled  surfaces  typical 
of  natural  objects.  This  problem  is  most  acute  when  we  attempt  to 
develop  techniques  for  recovering  3-D  shape,  for  how  can  we  expect 
to  extract  3-D  information  in  a  world  populated  by  rough,  crumpled 
surfaces  when  all  of  our  models  refer  to  smooth  surfaces  only?  The 
lack  of  a  3-D  model  for  such  naturally  occurring  surfaces  has  generally 
restricted  image-understanding  efforts  to  a  world  populated  exclusively 
by  smooth  objects,  a  sort  of  “Play-Doh"  world  |lj  that  is  not  much 
more  general  than  the  blocks  world. 

Standard  shape-from-shading  |2,3|  methods,  for  instance,  all 
employ  the  heuristic  of  “smoothness”  to  relate  neighboring  points  on  a 
surface  Shape-from-texture  |4,5|  methods  make  similar  assumptions: 
their  models  are  concerned  either  with  markings  on  a  smooth  surface, 
or  discard  three-dimensional  notions  entirely  and  deal  only  with  ad  hoc 
measurements  of  the  image.  Before  we  can  reliably  employ  such  tech¬ 
niques  in  the  natural  world,  we  must  be  able  to  determine  which  sur¬ 
faces  are  smooth  and  w  hich  are  not  —  or  else  generalise  our  techniques 
to  include  the  rough,  crumpled  surfaces  typically  found  in  nature. 

To  accomplish  this,  we  must  have  recourse  to  a  3-D  model  com¬ 
petent  to  describe  both  crumpled  surfaces  and  smooth  ones.  Ideally, 
we  would  like  a  model  that  captures  the  intuition  that  smooth  surfaces 
are  the  limiting  case  of  rough,  textured  ones,  for  such  a  model  might 
allow  us  to  formulate  a  unified  framework  for  obtaining  shape  from 
both  shading  (smooth  surfaces)  and  texture  (rough  surfaces,  markings 
on  smooth  surfaces) 


The  research  reported  herein  was  supported  by  National  Science 
Foundation  Grant  No.  DCR-83-12766  and  the  Defense  Advanced 
Research  Projects  Agency  under  Contract  No.  MDA  903-83-C-0027 
(monitored  by  the  U.S.  Army  Engineer  Topogiaphic  Laboratory) 


Figure  I  Surface*  of  Increasing  Fractal  Dimension. 


The  fractal  model  of  surface  shape  |6,7|  appears  to  possess  the 
required  properties  Evidence  for  this  comes  from  recently  conducted 
surveys  of  natural  imagery  |f».8|  These  survey  found  that  the  fractal 
model  of  imaged  3-1)  surfaces  furnishes  an  accurate  description  of  most 
textured  and  shaded  image  regions.  Perhaps  -ven  more  convincing, 
however,  is  the  fact  that  fractals  look  like  natural  surfaces  [9,10,11], 
This  i*  important  information  for  workers  in  computer  vision,  because 
the  natural  appearance  of  fractals  is  strong  evidence  that  they  capture 
all  of  the  perceptually  relevant  shape  structure  of  natural  surfaces 

II.  FRACTALS  AND  THE  FRACTAL  MODEL 

During  the  last  twenty  years.  Benoit  B.  Mandelbrot  has  devel¬ 
oped  and  popularized  a  relatively  novel  class  of  mathematical  func¬ 
tions  known  a*  fractal >  [9. 10)  Fractals  are  found  extensively  in  nature 
(9.10.12).  Mandelbrot,  for  instance,  shows  that  fractal  surfaces  are 
produced  by  many  basic  physical  processes.  The  defining  characteristic 
of  a  fractal  is  that  it  has  a  fractional  dimension,  from  which  we  get  the 
word  “fractal.”  One  general  characterization  of  fractals  is  that  they 
are  the  end  result  of  physical  processes  that  modify  shape  through  lo¬ 
cal  action  After  innumerable  repetitions,  such  p  ocesses  will  typically 
produce  a  fractal  surface  shape 

The  fractal  dimension  of  a  surface  corresponds  quite  closely  to  our 
intuitive  notion  of  roughne>s.  Thus,  if  we  were  to  generate  a  series  of 
scenes  with  the  same  3-1)  relief  but  with  increasing  fractal  dimension 
/),  we  would  obtain  a  sequence  of  surfaces  with  linearly  increasing 
perceptual  roughness,  as  is  shown  in  Figure  1:  (a)  shows  a  flat  plane 
(I)  ^  2).  (b)  rolling  countryside  (/)  2.1),  (c)  an  old,  worn  mountain 

range  (/)  2.3).  (d)  a  young,  rugged  mountain  range  ( D  2.5),  and, 
finally  (e).  a  stalagmite-covered  plane  (/)  as  2.8). 

EXPERIMENTAL  NOTE:  Ten  naive  subjects  ( natural - 
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language  researchers)  were  shown  sets  of  fifteen  l-D  curves  and 
2-D  surfaces  with  varying  fractal  dimension  but  constant  range 
I eg  .  see  Figure  If.  and  asked  to  estimate  roughness  on  a  scale 
of  one  (smoothest)  to  ten  ( roughest )  The  mean  of  the  subject's 
estimates  of  roughness  had  a  nearly  perfect  0  98  correlation  (i.e . 
96*7  of  the  variance  was  accounted  for)  (p  <  0  001)  with  the 
curve's  or  surfaces's  fractal  dimension  The  fractal  measure  of 
perceptual  roughness  is  therefore  almost  twice  as  accurate  as  any 
other  reported  to  date.  e  g..  \l3). 


Fractal  llmwnian  Functions.  Virtually  all  fractals  encountered  in 
physical  models  have  two  additional  properties:  (1)  each  segment  is 
statistically  similar  to  all  others;  (2)  they  are  statistically  invariant 
over  wide  transformations  of  scale  The  path  of  a  particle  exhibiting 
Brownian  motion  is  the  canonical  example  of  this  type  of  fractal;  the 
discussion  that  follows,  therefore,  will  be  devoted  exclusively  to  frac¬ 
tal  Brownian  functions,  which  are  a  mathematical  generalization  of 
Brownian  motion 


A  random  function  /(/)  is  a  fractal  Brownian  function  if  for  all  x 
and  A/ 


V  ) 


(1) 


where  F(y)  is  a  cumulative  distribution  function  [7|.  Note  that  z  and 
/(/)  can  be  interpreted  as  vector  quantities,  thus  providing  an  extension 
to  two  or  more  topological  dimensions  If  /(x)  is  scalar,  the  fractal 
dimension  D  of  the  graph  described  by  /(x)  is  D  *  2  —  H  .  If//*  1/2 
and  F(y)  comes  from  a  zero-mean  Gaussian  with  unit  variance,  then 
/(/)  is  the  classical  Brownian  function. 

The  fractal  dimension  of  these  functions  can  be  measured  either 
directly  from  /(/)  by  using  of  liquation  1,  or  from  /(x)’s  Fourier  power 
spectrum'*  /*(/).  as  the  spectral  density  of  a  fractal  Brownian  function 
is  proport  ion  al*  to 


Properties  of  Fractal  Brownian  Functions.  Fractal  functions  must 
be  stable  over  common  transformations  if  they  are  to  be  useful  as  a 
descriptive  tool  Previous  reports  |6,7)  have  shown  that  the  fractal 
dimension  of  a  surface  is  invariant  with  respect  to  linear  transforma* 
lions  of  the  data  and  to  transformations  of  scale.  Estimates  of  fractal 
dimension,  therefore,  may  be  expected  to  remain  stable  over  smooth, 
monotonic  transformations  of  the  image  data  and  over  changes  of  scale. 


A.  The  Fractal  Surface  Model  And  The  Imaging  Prorate 

Before  we  can  use  a  fractal  model  of  natural  surfaces  to  help  us 
understand  images,  we  must  determine  how  the  imaging  process  maps 
a  fractal  surface  shape  into  an  image  intensity  surface.  The  first  step 

is  to  define  our  terms  carefully 

DEFINITE  >N:  \  fractal  Brownian  eurface  is  a  continuous  function 

that  obeys  the  statistical  description  given  by  Equation  (1),  with  x  as 

We  rewrite  Equation  (I |  to  obtain  the  following  description  of  the 
manner  in  which  the  second-order  statistics  of  the  image  change  with 
K-alr  E(|Ak,||||AJ-||-«  =  £||A/i,_1|)  where  E(|A/a.|)  »  the 
pec  ted  value  of  the  change  in  intensity  over  distance  Ax.  To  estimate 
//.  and  thus  I),  we  calculate  the  quantities  E(|A/a*|)  for  various  Ax, 
and  use  a  least-squares  regression  on  the  log  of  our  rewritten  Equation 

(I) 

"That  is.  since  the  power  spectrum  P(f)  is  proportional  lof~2H~l ,  we 
may  use  a  linear  regression  on  the  log  of  the  observed  power  spectrum  as 
a  function  of  /(eg.a  regression  using  log(P(/))  *  -(2//  +  l)log(/)+  k 
for  various  v.dues  of  /)  to  determine  the  power  //  and  thus  the  fractal 

dimension 

^Discussion  °f  the  rather  technical  proof  of  this  proportionality  may 
be  found  in  Mandelbrot  |l()|. 


a  two-dim*  nsional  vector  at  all  scales  (i  e  ,  values  of  Ax)  between  some 
smallest  (A/m,,,)  and  largest  (A/„,4«)  scales 

DEFINITION  \  spatially  isotropic  fractal  Brownian  surface 

is  a  surface  in  whi  h  the  component*  of  the  surface  normal  N  * 
(N,  N„.  N-)  are  themselves  fractal  Brownian  surfaces  of  identical  frac¬ 
tal  dimension 

<>ur  previous  papers  |6.7)  have  presented  evidence  showing  that 
most  natural  surfaces  are  spatially  isotropic  fractals,  with  Axm^n  and 
A/,„.ij  being  the  size  of  the  projected  pixel  and  the  size  of  the  examined 
surface  patch,  respectively.  This  finding  has  since  been  confirmed  by 
others  (#].  Furthermore,  it  is  interesting  to  note  that  practical  fractal 
generation  techniques,  such  as  those  used  in  computer  graphics,  have 
had  to  constrain  the  fractal-generating  function  to  produce  spatially 
isotropic  fractal  Brownian  surfaces  in  order  to  obtain  realistic  imagery 
(1 1]  Thus,  it  appears  that  many  real  3-D  surfaces  are  spatially  isotropic 
fractals,  at  least  over  a  wide  range  of  scales* 

With  these  definitions  in  hand,  we  can  now  address  the  problem 
of  how  3-D  fractal  surfaces  appear  in  the  2-D  image 

Proposition  1.  A  3-D  surface  with  a  spatially  isotropic  fractal 
Brownian  shape  produces  an  image  whose  intensity  surface  is  fractal 
Brownian  and  whose  fractal  dimension  is  identical  to  that  of  the  com¬ 
ponents  of  the  surface  normal,  given  a  Lambertian  surface  reflectance 
function  and  constant  illumination  and  albedo 

This  proposition  (proved  in  |7|)  demonstrates  that  the  fractal 
dimension  of  the  surface  normal  dictates  tfie  fractal  dimension  of  the 
image  intensity  surface  and.  of  course,  the  dimension  of  the  physical 
surface  Simulation  of  the  imaging  process  with  a  variety  of  imag¬ 
ing  geometric*,  and  reflectance  functions  indicates  that  this  proposition 
will  hold  quite  generally:  the  “roughness"  of  the  surface  seems  to  dic¬ 
tate  the  "roughness"  of  the  image  If  we  know  that  the  surface  is 
homogeneous.  *  we  can  estimate  the  fractal  dimension  of  the  surface 
by  measuring  the  fractal  dimension  of  the  image  data.  What  we  have 
developed,  then,  is  a  method  for  inferring  a  basic  property  of  the  3-D 
surface  i.e  .  its  fractal  dimension  —  from  the  image  data.  The  fact 
that  fractal  dimension  has  also  been  shown  to  correspond  closely  to  our 
intuitive  notion  of  roughness  confirms  the  fundamental  importance  of 
the  measurement. 

EXPERIMENTAL  NOTEs/V/teen  naive  subjects  (mostly  lan¬ 
guage  researchers )  were  shown  digitiied  images  of  eight  natural 
textured  surfaces  drawn  from  Brodats  ( 14) .  They  were  naked  *it 
you  were  to  draw  your  finger  horiiontally  along  the  surface  pic¬ 
tured  here,  how  rough  or  smooth  would  the  surface  feel T"  —  i.e., 
they  were  asked  to  estimate  the  3-D  roughness/smoothness  of  the 
viewed  surfaces.  A  scale  of  one  ( smoothest )  to  ten  ( roughest )  was 
used  to  indicate  3-D  roughness/smoothness.  The  mean  of  the 
subject  s  estimates  of  3-D  roughness  had  an  excellent  0.91  correla¬ 
tion  (i.e.,  83*7  of  the  variance  accounted  was  for)  (p  <  0.001 )  with 
roughnesses  predicted  by  use  of  the  image’s  2-D  fractal  dimension 
and  Proposition  I  This  result  supports  the  general  validity  of 
Proposition  I 

B.  Identification  of  Shading  Versus  Thxturs 

Fractal  functions  with  If  ^  0  are  planar  except  for  random  varia¬ 
tions  described  by  the  function  F(y)  in  Equation  (1).  If  the  variance 
of  F(y)  is  small  people  judge  these  surfaces  to  be  "smooth”;  thus, 
the  fractal  model  with  small  values  of  H  is  appropriate  for  modeling 
smooth,  shaded  regions  of  the  image.  If  the  surface  has  significant  local 

This  does  not  mean  that  the  surfaces  are  completely  isotropic,  mearly 
that  their  fractal  (metric)  properties  are  isotropic. 

** Perhaps  determined  by  the  use  of  imaged  color. 
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fluctuation*,  i  c  if  f'(.v)  is  largo,  the  surface  is  seen  as  being  smooth 
but  textured.  in  the  sense  that  markings  or  some  other  2-D  effect  is 
modifing  the  appearance  of  the  underlying  smooth  surface.  In  contrast, 
fractals  with  //  >  0  are  not  perceived  as  smooth,  but  rather  as  being 
rough  o*-  three-dimension  ally  textured. 

The  fractal  model  can  therefore  encompass  shading,  2-D  texture, 
and  3-D  texture,  with  shading  as  a  limiting  case  in  the  spectrum  of 
3-D  texture  granularity  .  The  fractal  model  thus  allows  us  to  make  a 
reasonable,  rigorous  and  perceptually  plausible  deflation  of  the  cate¬ 
gories  “textured**  versus  “shaded."  “rough"  versus  “smooth,”  in  terms 
that  can  be  measured  by  using  the  image  data. 

The  ability  to  differentiate  between  “smooth"  and  “rough”  sur¬ 
faces  is  critical  to  the  performance  of  current  shape-from-shading  and 
shape-from-texture  techniques.  For  surfaces  that,  from  a  perceptual 
standpoint,  are  smooth  (//  0)  and  not  2-D  textured  (Vfsr(f'(y)) 

small),  it  seems  appropriate  to  apply  shading  techniques  *  For  sur¬ 
faces  that  have  2-D  texture  it  is  more  appropriate  to  apply  available 
texture  measures.  Thus,  use  of  the  fractal  surface  model  to  infer 
qualitative  3-D  shape  (namely,  smoothness/roughness),  has  the  poten¬ 
tial  of  significantly  improving  the  utility  of  many  other  machine  vision 
methods. 

III.  Shape  Estimate*  From  Tsxture  And  Shading 

The  fractal  surface  model  allows  us  to  do  quite  a  bit  better  than 
simply  identifying  smooth  versus  textured  surfaces  and  applying  pre¬ 
viously  discovered  techniques.  Because  we  have  a  unified  model  of 
shading.  2-D  texture  and  3-D  texture,  we  can  derive  a  shape  estimation 
procedure  that  treats  shaded,  i wo-dimensionally  textured,  and  three- 
dimensionally  textured  surfaces  in  a  single,  unified  manner. 

A.  Development  of  a  Robust  Texture  Measure 

Let  us  assume  that  (I)  albedo  and  illumination  are  constant  in 
the  neighborhood  being  examined,  and  (2)  the  surface  reflects  light 
isotropically  (Lambert  s  law).  We  are  then  led  to  this  simple  model  of 
image  formation: 

/  -  pX(N  L)  (2) 

where  p  is  surface  albedo,  X  is  incident  flux.  N  is  the  [three-dimensional] 
unit  surface  normal,  and  L  is  a  (three-dimensional]  unit  vector  point¬ 
ing  toward  the  illuminant.  The  first  assumption  means  that  the  model 
holds  only  within  homogeneous  regions  of  the  image,  e.g.,  regions 
without  self-shadow  mg.  The  second  assumption  is  an  idealisation  of 
matte,  diffusely  reflecting  surfaces  and  of  shiny  surfaces  in  regions  that 
are  distant  from  highlights  and  ^pecularities  (3). 

In  Equation  (2).  image  intensity  is  dependent  upon  the  surface 
normal,  as  all  other  variables  have  been  assumed  constant.  Similarly, 
the  second  derivative  of  image  intensity  is  dependent  upon  the  second 
derivative  of  the  surface  normal,  i.e., 

N  L)  (3) 

(Notation:  wr  will  write  (Pi  and  <P N  to  indicate  the  aecoad  deriva¬ 
tive  quantities  computed  along  some  image  direction  [it,  iy)  —  thin 
direction  to  hr  indicated  implicitly  by  the  context.) 

The  fractal  model  taken  together  with  previoua  reaulte  (16],  implies 
that  on  average  d=N  is  parallel  to  N.  Consequently,  if  we  divide 
Equation  (2)  by  Equation  (3)  we  will  on  average  obtain  the  following 

’indeed,  it  is  only  in  these  cases  that  measurement  noise  can  be  reduced 
(by  averaging)  to  the  levels  required  by  shape-from-shading  technique* 
without  simultaneously  destroying  evidence  of  surftce  shape. 


relationship 

. . 

where  /*,(/)  denotes  the  expected  value  [mean]  of  x.  That  is,  we  cao 
estimate  how  crumpled  and  textured  the  surface  w  (i.e.,  the  average 
magnitude  of  the  surface  normal's  second  derivative)  by  observing 

£(l  si/i\y 

Equation  (4)  provides  us  with  a  measure  of  3-D  texture  that  is  (on 
average  and  under  the  above  assumptions)  independent  of  illuminant 
effects  This  measure  is  affected  by  foreshortening,  however,  which  acts 
to  increase  the  apparent  frequency  of  variations  in  the  surface,  e.g.,  tbe 
average  magnitude  of  i^N.  We  ran.  therefore,  obtain  an  estimate  of 
surface  orientation  by  employing  the  approach  adopted  in  other  texture 
work  |5|:  if  we  assume  that  the  3-D  surface  texture  is  isotropic,  the 
surface  tilt*  is  -imply  the  direction  of  maximum  E(|^///|)  and  the 
surface  slant**  can  be  deri>ed  from  the  ratio  between  maxg  Efjd2 l/I\) 
and  min#£(|<fz///|).  where  6  designates  the  (implicit]  direction  along 
which  the  texture  measure  is  evaluated  Specifically,  the  surface  slant 
is  the  arc  cosine  of  j/v.  the  >component  of  the  surface  normal,  and 
for  isotropic  textures  c.v  »»  equal  to  the  square  root  of  this  ratio.  The 
square-root  factor  is  necessitated  by  the  use  of  second-derivative  terms. 

One  of  the  advantages  of  this  shape-from-texture  technique  is  that 
not  only  ran  it  be  applied  to  the  2-D  textures  addressed  by  other 
researchers  1 1.5]  (b>  simply  using  this  texture  frequency  measure  in 
place  of  theirs*  ).  but  it  ran  also  be  applied  to  surfaces  that  are 
three-dirnensionally  textured  and  in  exactly  the  same  manner.  This 
texture  measure,  therefore,  allows  us  to  extend  existing  shape-from- 
texture  methods  beyond  2-D  textures  to  encompass  3-D  textures  as 
well.  f 

B.  Development  of  a  Robust  Shape  Eatlmator 

These  shape-from-texture  techniques  are  critically  dependent 
upon  the  assumption  of  isotropy:  when  the  textures  are  an  iso  to  pic 
(stretched),  the  error  is  substantial.  Estimates  of  the  fractal  dimension 
of  the  viewed  surface  |6,7|.  by  virtue  of  their  independence  with  respect 
to  multiplicative  transforms,  offer  a  partial  solution  to  this  problem. 
Because  foreshortening  is  a  multiplicative  effect,  tbe  computed  fractal 
dimension  is  not  affected  by  the  orientation  of  tbe  surface. **  Tbus, 
if  we  measure  the  fractal  dimension  of  an  isotropically  textured  sur¬ 
face  along  the  /  and  y  directions,  the  measurements  must  be  identical. 
If.  however,  we  find  that  they  are  unequal,  we  then  huve  prim  t  t*  tie 
evidence  of  anisotropy  in  the  surface. 

This  method  of  identifying  anisotropic  textures  is  most  effective 
when  each  point  on  the  surface  has  the  same  direction  and  magnitude 
of  anisotropy,  for  in  these  cases  we  can  accurately  discriminate  changes 
in  fractal  dimension  between  the  r  and  y  directions.  When  the  surfnee 
texture  is  variable,  however,  this  indicator  of  anisotropy  becomes  less 
useful  Thus,  local  variation  in  the  surface  texture  remains  a  major 
source  of  eiror  in  our  estimation  techniques:  it  is  therefore  important 
to  develop  a  method  of  estimating  surface  orientation  thnt  is  robust 
with  respect  to  local  variation  in  the  surface  texture. 

The  image-plane  component  of  the  surface  normal,  i.e.,  tbe  direction 
the  surface  normal  would  face  if  projected  onto  the  image  plane. 

*The  depth  component  of  the  surface  normal. 

TThis  measure  includes  edge  information,  i.e.,  the  frequency  of  Marr- 
Hildreth  zero-crossings  as  we  move  in  a  given  direction  mppenrs  to  be 
proportional  to  E[\<PlH\)  along  that  direction;  consider  thnt  Marr- 
Hildreth  zero-crossings  are  also  zero-croesinga  of  d*///. 

At  least  not  until  self-occlusion  effects  have  become  dominant  in  the 
appearance  of  the  surface. 
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Figure  2.  Variation  in  Focal  Teilure  (a)  Compared  with  No  Variation  (b). 


Such  robustness  can  In*  obtained  by  applying  regional,  rather  than 
purely  oral,  constraints  Natural  textures  are  often  “homogeneous" 
over  substantial  regions  of  the  image  although  there  may  be  significant 
local  variation  within  the  texture,  because  the  processes  that  act  to 
create  a  texture  typically  affect  regions  rather  than  points  on  a  surface 
This  fact  is  the  basis  for  interest  in  texture  segmentation  techniques 
C  urrent  shape-from-texture  techniques  do  not  make  use  of  the  regional 
nature  of  textures,  relying  instead  on  point-by -point  estimates.  By 
capitalizing  on  the  regional  nature  of  textures  we  can  derive  a  substan¬ 
tial  additional  constraint  on  our  shape  estimation  procedure 

Let  us  assume  that  we  are  viewing  a  textured  planar  surface  whose 
orientation  is  a  30*  slant  and  a  vertical  tilt.  Let  us  further  suppose 
that  the  surface  texture  varies  randomly  from  being  isotropic  to  being 
anisotropic  (stretched)  up  to  an  aspect  ratio  of  3  1.  with  the  direction 
of  this  anisotropy  also  varying  randomly.  Such  a  surface,  covered  with 
small  crosses,  is  shown  in  Figure  2(a);  for  comparison,  the  same  surface, 
minus  anisotropies,  is  shown  in  Figure  2(b). 

If  we  apply  standard  shape  estimation  techniques  —  i.e.,  estimat¬ 
ing  the  amount  of  foreshortening  (and  thus  surface  orientation)  by  the 
ratio  of  some  texture  measure  along  the  [apparently]  unforshortened 
and  [apparently!  maximally  foreshortened  directions  —  our  estimates 
of  the  foreshortening  magnitude  will  vary  widely  with  a  mean  error  of 
65*7  and  an  rms  error  of  81*7  If.  however,  we  estimate  the  value  a 
of  the  unforshortened  texture  measure  by  examining  the  entire  region, 
and  then  compare  this  regional  estimate  to  the  texture  measure  along 
the  (apparently)  maximally  foreshortened  direction  then  our  mean  er¬ 
ror  is  reduced  to  40*7  and  the  rms  error  to  49%. 

By  combining  this  notion  of  regional  estimation  with  the  texture 
measure  developed  above,  i.e..  E[\i{~ ///|).  we  can  construct  the  follow¬ 
ing  shape-from-texture  algorithm  that  is  able  to  deal  with  both  smooth 
two-dimensionally  textured  surfaces  and  rough,  three-dimensionally 
textured  surfaces,  and  that  is  robust  with  respect  to  local  variations 
in  the  surface  texture 


Figure  3  Tuckrrman's  Ravine 


as  the  directions  of  maximum  and  minimum 


are  orthogonal 

We  may  therefore  estimate  zjv,  the  z  component  of  the  surface 
normal,  by 


where  .1  =  E(|VS///|)  and  a  is  the  regional  estimate  of  the  unforeshor- 
tened  value  of  E(|d*///|).  The  constant  a  can  be  estimated  either  by 
the  median  of  the  local  [apparently!  unforeshortened  texture-  neacure 
values,  or  by  use  of  the  constraint  that  0  <  2/v  <  1  within  the  region 
The  direction  of  surface  tilt  can  then  be  estimated  by  the  gradient  of 
the  resulting  slant  field  e  g.,  the  local  gradient  of  the  z n  values  — 
or  (as  in  other  methods)  by  examining  each  ;mage  direction  to  find  the 
one  with  the  largest-value  of  the  texture  frequency  measure.  In  actual 
practice  we  have  found  that  the  gradient  method  is  more  stable 


D.  A  Unified  Treatment  of  Shading  and  Texture 

The  fractal  surface  model  captures  the  intuitive  notion  that,  if 
we  examine  a  series  of  surfaces  with  successively  less  three-dimensional 
texture,  eventually  the  surfaces  will  appear  shaded  rather  than  tex¬ 
tured.  Because  the  shape-from-texture  technique  developed  here  was 
built  on  the  fractal  model  we  might  expect  that  it  too  would  degrade 
gracefully  into  a  shape-from-shading  method.  This  is  in  fact  the  case: 
this  shape-from-texture  technique  is  identical  to  the  local  shape-from- 
shading  technique  previously  developed  by  the  author  (15).  That  is,  we 
have  developed  a  shape-from-x  technique  that  applies  equally  to  2-D 
texture.  3-D  texture  and  'hading 

As  an  example  of  the  application  of  this  shape-from-texture- 
and-shading  technique.*  Figure  3  shows  (a)  the  digitized  image  of 
Tuckerman’s  ravine  (a  skiing  region  on  Mt.  Washington  in  New 
Hampshire),  and  (b)  a  relief  map  giving  a  side  view  of  the  estimated 
surface  shape,  obtained  by  integrating  the  slant  and  tilt  estimates  ** 


C.  A  Shape  Estimation  Algorithm 

We  may  construct  a  rather  elegant  and  efficient  shape  estimation 
algorithm  based  on  the  notion  of  regional  estimation  and  on  the  texture 
measure  introduced  above  by  employing  the  fact  that 


for  any  orthogonal  u,  v.  This  identity  will  allow  us  to  estimate  the 
surface  slant  immediately  rather  than  having  to  search  all  orientations 
for  the  directions  along  which  we  obtain  the  maximum  and  minimum 
values  of  E[\tf  I / 1\) 

Let  us  assume  that  we  have  already  determined  a  » 
min*  E’(|d2///|),  which  is  the  regional  estimate  of  unforeshortened 
Ed^Nj).  When  the  estimate  of  a  is  exact.  Equation  (5)  gives  us  the 


'This  example  was  originally  reported  in  Peatlaad  |15|  as  the  output 
of  a  local  shape-from-shading  technique  followed  by  averaging  and  in¬ 
tegration  Thi**  algorithm  is  identical  to  the  shape-from-texture  tech¬ 
nique  described  here;  in  fact,  investigation  of  the  shape-from-texture 
properties  of  this  method  was  motivated  by  the  consternation  caused 
by  this  successful  application  of  a  shading  technique  to  a  textured  sur¬ 
face. 


This  relief  map  may  be  compared  directly  with  a  topographic  map  of 
the  area;  when  we  compare  the  estimated  shape  with  the  actual  shape, 
we  fiad  that  the  roll-off  at  the  top  of  Figure  3(b)  and  the  steepness  of 
the  estimated  surface  are  correct  for  this  surface;  the  slope  of  this  area 
of  the  ravine  averages  GOV 


|15|  Pent  land.  A  IV  (1081).  “Local  Shape  Analysis."  IEEE  Transactions 
on  Pattern  Analysis  and  Machine  Intelligence.  March  1984.  pp 
170-187 


IV.  Summary 

Shape-from-shading  and  texture  methods  have  had  the  serious 
drawback  that  they  are  applicable  only  to  smooth  surfaces,  while 
real  surfaces  are  often  rough  and  crumpled  We  have  extended  these 
methods  to  real  surfaces  using  the  fractal  surface  model  (6,7).  The 
fractal  model's  ability  to  distinguish  successfully  between  perceptually 
"smooth"  and  perceptually  “rough"  surfaces  allows  reliable  application 
of  shape  estimation  techniques  that  assume  smoothness  Furthermore 
we  have  used  the  fractal  surface  model  to  construct  a  method  of  es¬ 
timating  3-D  shape  that  treats  shading  and  texture  in  a  unified  manner. 
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The  shape  algorithm  produces  estimates  of  the  surface  orientation. 
For  display  purposes,  these  estimates  were  integrated  to  produce  a  relief 
map  of  the  surface 
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ABSTRACT 

Matching  feature  points  is  a  reliable  way  for  measuring 
image  disparities  that  arise  in  case  of  stereo  or  images  of 
time  varying  scenes,  lo  do  this  effectively  we  require  a 
means  of  isolating  the  feature  points.  Here  we  describe  a 
simple  scheme  to  detect  such  interesting  points  in  images. 
A  correspondence  algorithm  then  assigns  matches  based 
on  local  consensus.  Fxperiments  show  this  scheme  to  be 
robust  and  widely  applicable  to  photographic  images  of 
natural  scenes. 


1.0  Introduction 

Image  disparity  is  the  displacement  of  the  image  of  a 
world  point  due  to  a  shift  in  the  camera  position.  Meas¬ 
urement  of  disparities  used  in  for  computing  depth  from 
stereoscopic  image  pairs  or  for  analyzing  motion  from 
sequences  of  images  of  a  dynamic  scene  In  the  case  of 
motion,  given  a  sensor  with  high  temporal  resolution  and 
good  spatial  acuity,  one  can  compute  instantaneous 
motion  disparity  using  spatio  temporal  intensity  variation 
and  assuming  smooth  motion  and  illumination  constancy. 
Such  techniques  are.  however,  largely  untested  in  images 
of  natural  scenes. 

Computation  of  disparities  by  matching  has  had 
some  measure  of  success  in  the  case  of  both  stereo  and 
motion.  Two  problems  with  this  kind  of  technique  are 
expensive  computation  (for  correlation  type  of  matching) 
and  large  computation  time  (for  matching  with  iterative 
improvement  or  relaxation).  The  first  problem  leads  to 
hardware  complexity  and  the  second  detracts  from  real 
time  applications. 

fhe  approach  outlined  here  is  an  attempt  to  com¬ 
bine  the  benefits  of  both  correlation  type  of  matching  and 
relaxation  type  of  technique.  Our  algorithm  does  not 
correlate  image  patches,  but  first  isolates  interest  points  in 
the  images  by  simple  linear  convolutions,  which  can  be 


implemented  in  parallel.  Then  interest  points  are 
matched  to  obtain  disparities.  The  matching  algorithm 
uses  heuristics  that  are  compatible  with  psychophysical 
observations  of  the  visual  mechanisms  in  man  and 
animals  (e.g.Ramachandran  &  Anstis  1983). 

2.0  Computing  Interest  Points 

An  interest  point  is  a  point  in  the  image  (actually  a 
small  neighbourhood)  that  has  properties  that  distinguish 
it  from  its  neighbouring  points.  The  properties  in  ques¬ 
tion  may  be  simple,  like  gray  levels,  or  sophisticated  ones 
indicative  of  the  local  topography  of  the  imaged  surface. 
Previous  approaches  to  finding  interest  points  are 
exemplified  in  the  work  of  !Moravec(19',7).Kuchen  & 
Rosenfeld(  1980)  and  Nagel(1983).  The  above  approaches 
utilize  operators  that  are  nonlinear  and  sometimes  require 
computation  of  higher  order  spatial  derivatives  of  the 
image  function.  Another  method  is  a  locationwise  topo¬ 
logical  classification  of  the  image  function.  However, 
topological  analysis  is  computationally  expensive.  Furth¬ 
ermore.  this  method  has  to  deal  with  the  Hessian  of  the 
image  function  at  every  point.  Hence,  it  is  also  inherently 
unstable  in  the  presence  of  noise.  A  crucial  observation 
(Brown  1980),  is  that,  interesting  patterns  in  the  image 
can  be  thought  to  have  a  sharply  peaked  autocorrelation 
function  .  This  observation  gives  rise  to  a  practical  interest 
operator  which  selects  image  locations  whose  autocorrela¬ 
tion  decays  sharply  with  increasing  eccentricity.  This 
could  be  implemented  with  great  simplicity  if  we  could 
design  matching  templates  with  the  above  property.  How¬ 
ever  no  constructive  algorithm  exists  to  construct  this  type 
of  template.  In  any  case,  sharply  diminishing  autocorre¬ 
lation  is  a  desirable  property  for  operator  templates,  and 
it  is  useful  to  bear  this  fact  in  mind. 

The  location  of  interest  points  is  the  first  step  in 
tackling  the  so-called  correspondence  problem  Hence  the 
operator  must  satisfy  3  requirements: 

1.  Selected  points  must  be  sparse 

2.  Contours  must  be  suppressed 

3.  Interest  points  should  be.  stable  across  frames 

The  orthogonal  decomposition  technique  described  subse¬ 
quently.  is  an  attempt  to  satisfy  the  above  requirements. 


'=/>'=£-  f  (1) 


The  method  described  here  is  computationally 
simpler  than  the  prevalent  techniques  at  locating  interest¬ 
ing  points  in  images.  The  image  function  is  decomposed 
into  a  weighted  sum  of  basts  functions.  The  central  idea 
being  that  if  the  basis  is  chosen  with  care,  then  the  distri¬ 
bution  of  the  respective  weights  indicates  the  nature  of 
the  image  function  directly.  A  similar  approach  can  be 
seen  in  the  literature  in  other  image  processing  contexts 
[Frei  &  Chen  l977.Hueckel  197 1|.  A  pleasing  aspect  of 
this  design  is  that  there  is  neuro-physiological  evidence  to 
support  our  approach  [Flubel  &  Wiesel  1959) 

2.1  Preliminaries 

The  image  /<»,>.»)  is  a  three  dimensional  function. 
However,  we  concern  ourselves  with  a  lime  slice  of  this 
function  at  time  =  /„  thus  obtaining  a  two  dimensional 
function 

/( <  >  )  =  H.x.y.t*) 

An  image  vector  at  a  location  (x.>)  is  formed  by  con¬ 
catenating  the  rows  of  the  following  3x3  image  patch 


/(.»- l.V ■+■  1) 
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The  the  image  vector  r  belongs  to  a  9  dimensional 
Vector  Space  defined  over  the  Real  field. 

r  =  |/,  /;/,.•••  ./„| 

where  /,  is  /tx-la-l),  /<  is  /(<.>-!)  and  so  on.  alterna¬ 
tively 

i*  = 

*  =o 

where  e*  is  the  column  of  the  9x9  identity  matrix. 

The  image  vector  as  defined  previously,  is  represented 
with  respect  to  the  basis  {e  (.  The  components,  therefore, 
by  themselves  do  not  convey  any  information  regarding 
the  local  topography  of  the  image. 

When  we  define  the  image  in  this  manner,  the 
operation  of  convolving  the  image  with  a  given  point 
spread  function  or  correlating  with  a  particular  feature 
template  can  be  expressed  with  respect  to  the  vector  inner 
product.  Thus  convolution  becomes 

=  22'k 

where  g  is  the  vector  representation  of  the  point  spread 
function  and  t  is  the  image  vector  at  a  point. 

With  the  above  interpretation  in  mind  we  freely  inter¬ 
change  the  terms  function  and  vector  in  subsequent  text. 
More  importantly,  thinking  of  point  spread  functions  as 
vectors  allows  us  to  transform  the  image  vector  into 
different  finite  basis  space  corresponding  to  the  prototypi¬ 
cal  features  that  we  are  interested  in.  T  his  transformation 
is  wrought  by  a  non-singular  matrix  T  whose  columns  are 
the  feature  basis  vectors  <f  >  Thus  the  image  vector  ?  is 
transformed  into  the  vector  »•  where 


The  purpose  of  this  transformation  is  to  obtain  an  image 
code  whose  components  correspond  to  the  degree  of 
match  between  the  image  function  and  the  feature  func¬ 
tions.  In  general,  to  compute  the  transformed  vector  •' 
from  t  requires  the  solution  of  simultaneous  linear  equa¬ 
tions.  However,  computation  of  the  k1"  component  of  <■ 
becomes  simple  when  ft  is  orthogonal  to  the  other  basis 
vectors  in  the  set  {f. }.  In  this  case,  we  have  from  equa¬ 
tion  ( 1 ) 

r- h  =  v*  Ilf*  II- 


r-U  =  v* 

Since  the  image  vector  is  finite  dimensional  we  can 
design  a  orthogonal  basis  set  for  the  space  of  the  image 
vector.  In  addition,  this  basis  set  is  constructed  in  such  a 
way  that  the  each  basis  corresponds  to  a  feature  primitive. 
Thus  decomposing  the  image  vector  in  terms  of  the  new 
basis  would  give  us  a  new  set  of  components  (or  weights) 
indicating  the  strength  of  each  of  the  features  represented 
by  the  respective  basis  vector. 

2.2  Selection  of  the  Feature  Basis 

The  set  of  basis  functions  in  our  model  is  built 
around  feature  primitives  like  edge.maxima/minima  and 

saddle  type  variation.  Since  the  image  vector  is  nine 
dimensional  (i.e.  the  operator  size  is  3x3)  there  are  nine 
elements  in  the  feature  basis  space.  There  are  many  ver¬ 
sions  of  edge  masks  of  which  the  Sobel  masks  have  been 
chosen.  The  maxima/minima  feature  is  represented  by 
the  laplacian  centre/surround  operator.  There  are  three 
saddle  type  primitives.  Finally,  the  basis  is  completed  by 
two  edgelike  masks  and  an  averaging  operator  to  capture 
the  dc  intensity.  The  image  space  is  thus  divided  into 
three  subspaces: 

1.  The  Extremum  subspace  defined  by  the  laplacian 
and  saddle  masks. 

2.  The  edge  subspace. 

3.  The  average  (  or  dc  )  subspace. 


in  particular,  if  the  basis  vectors  are  chosen  so  that  they 
form  an  orthonormal  set  then 


The  F.dge  Basis  Functions. 


The  Extremum  Basis  Functions. 
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The  Averaging  function. 

3.0  The  Matching  Algorithm 

The  correspondence  problem  is  almost  universally 
regarded  as  difficult.  As  mentioned  earlier,  it  arises  in  the 
measurement  of  image  disparities.  The  problem  is 
magnified  in  the  case  of  motion  since  the  disparity  in  this 
case  is  not  constrained,  as  in  the  case  of  stereo  disparity, 
to  be  parallel  to  a  base  line.  The  overall  scheme  is  sim¬ 
ple:  select  interest  points  in  image  frames  and  then  decide 
which  point  from  one  frame  matches  another  point  from 
the  other  frame.  If  it  is  possible  to  obtain  interest  points 
that  are  sparse  then  correspondence  is  not  difficult.  Here 
sparseness  means  that  the  average  disparity  value  is 
smaller  than  the  average  spatial  distance  between  points 
in  the  same  image  frame.  One  way  to  obtain  this,  at  the 
cost  of  losing  accuracy,  is  to  operate  at  multiple  resolu¬ 
tion.  This  means  that  the  algorithm  is  applied  to  images 
that  are  bandpass  filtered  and  sampled  by  different  reso¬ 
lution  grids.  A  better,  albeit  costly,  way  is  to  use  correla¬ 
tion  of  image  patches  as  a  measure  of  the  closeness  of 
match.  This  again  is  prone  to  error  due  noise  or  due  to 
variation  in  the  average  intensity  in  corresponding  regions 
in  the  image  frames.  Another  source  of  confusion  is  a 
contour  -  giving  rise  to  what  Marr  &  Ullman  cail  the 
aperture  problem  [Vtarr  &  Ullman  198 1|. 

The  outline  of  our  algorithm  is  as  follow  s 

for  both  image  frames  do 
for  every  image  location  do 

begin 

decompose  the  image  into  components  with  respect  to 
normalised  feature  primitive  basis. 

decide  whether  to  select  or  reject  the  point. 

end 


for  every  interest  point  in  the  first  Jrame  do 

find  the  point  in  the  other  frame  that  best  matches  it. 

The  important  parts  of  this  algorithm  are 

1.  The  decision  rule  in  the  first  loop. 

2.  The  matching  method  in  the  second  Uxip. 

Hiese  are  related  since  a  stringent  decision  rule  may 
miss  many  good  interest  points  in  its  effort  to  maintain 
sparseness.  On  the  other  hand,  if  the  matching  algorithm 
is  relatively  sophisticated,  matching  labelled  points,  then 
the  decision  rule  need  not  be  sophisticated. 

Some  examples  of  decision  rules  are  : 

(1)  Projection  of  the  image  vector  onto  the  extremum 
space  is  greater  than  the  projection  onto  the  edge 
space.  2  rf t  >  2"f 

extremum  edge 

(2)  Of  all  the  image  components  in  the  feature  space, 
the  one  corresponding  to  the  laplacian  basis  is  the 
maximum. 

(3)  Either  rule  1  or  rule  2. 

(3)  Rule  3  with  the  proviso  that  if  there  is  a  clear  max¬ 
imum  in  the  projections  then  it  is  not  onto  one  of 
the  edge  space  bases. 

The  matching  rule  is  then  formulated  according  to 
whether  the  points  are  labelled  or  not.  In  case  of  unla- 
belled  points 

All  neighbouring  points  support  (vote  for)  a  particu¬ 
lar  disparity  value.  Similar  values  support  each  other 
in  a  local  region.  Shorter  length  disparities  are  pre¬ 
ferred.  A  point  adopts  a  match  for  which  it  finds  the 
maximum  support. 

The  strategy  is  similar  in  spirit  to  the  more  sophisti¬ 
cated  matchers,  for  instance,  those  using  labelled  points 
(e.g.  Prager  &  Arbib  1983).  In  our  case  the  points  carry  a 
label  which  is  computed  from  the  outputs  of  the  nine 
basis  operators.  The  label  is  a  code  that  identifies  the 
image  point  in  question.  Now  the  matcher  weights  the 
"supporting"  votes  according  to  the  similarity  of  these 
codes.  However,  we  avoid  iterative  refinement,  which  is 
common  in  similar  algorithms  (Barnard  &  Thompson 
1980). 

4.0  Conclusion 

The  experiments  that  we  have  conducted  so  far 
encourage  us  to  pursue  the  orthogonal  decomposition 
method  for  selection  of  interest  points.  The  algorithmic 
design  of  the  correspondence  scheme  is  biologically 
motivated.  The  advantages  of  the  method  are  that  it  is 
not  dependent  on  illumination  constancy,  it  eliminates 
contours  and  that  it  is  implementable  in  parallel 
hardware. 
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PLATE  I.  The  interest  operator  applied  to  the  photo¬ 
graph  of  a  natural  scene. 


PLATE  II.  Interest  points  in  an  office  scene. 


PI  ATE  111.  Correspondence  established  in  the  office 
scene. 
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Abstract 

This  paper  describes  a  theory  for  interpreting  line  drawings. 
Line  Drawings  are  considered  to  be  Image  Structure  graphs 
derived  from  the  image  by  segmentation  and  aggregation 
operations.  The  goal  of  interpretation  is  to  hypothesise  a 
corresponding  spatial  structure.  Inference  Rules  arc  derived 
from  geometry  and  six  additional  assumptions.  These  rules 
are  applicable  for  scenes  with  curved  os  well  as  planar  sur¬ 
faces.  /I  control  structure  is  sketched  which  works  on  a  least 
commitment  style  of  reasoning.  The  approach  is  illustrated 
with  a  worked  out  example. 

1.  Introduction. 

This  paper  describes  a  theory  of  line  drawing  interpretation. 
Section  2  reviews  past  work  in  this  and  related  areas  and  at¬ 
tempts  to  characterise  the  limitations  of  those  approaches. 
Section  3  describes  what  kind  of  input  we  expect  from  the 
lower  level  processes.  Section  4  describes  the  output  rep¬ 
resentation.  Section  5  develops  our  world-view-  what  kind 
of  objects  are  permitted  in  the  scene  .  Given  this,  we  ex¬ 
haustively  characterise  all  that  can  possibly  appear  in  the 
line  drawing.  Section  C  states  six  assumptions  which  seem 
to  be  the  key  to  the  process  of  interpretation  of  line  draw¬ 
ings.  Section  7  has  some  sample  inference  rules.  Section  8 
discusses  control  structure  aspects.  Section  9  has  a  solved 
example  to  illustrate  the  approach. 

Computer  implementation  of  this  theory  is  under  way 
with  a  program  called  DRISIITI. 

2.  Review  of  Past  Work. 

In  Computer  Vision,  interpretation  of  line  drawings  has 
been  taken  to  mean  many  different  things. 

I  Segmentation  into  individual  bodies  (Guzman  [7|). 

2.  Identifying  support  edges(Falk|(i],  Winston]  10]). 

3.  I.ine  labelling  using  a  junction  catalog(Mu(fman[8|, 
Clowes]  t|.  Wall z.|  1  ,‘!| )  or  Gradient  Space  Reasoning 
( Mackwor I  h|  I  I  i). 

4.  I'inding  whether  the  line  drawing  depicts  an  “impossi¬ 
ble'’  object  (Mackworth| 1 1|). 

5.  Finding  “optimum”  three  dimensional  shape  corres¬ 
ponding  to  a  single  isolated  image  rnutuur(ltarrow  and 
Tennonbaum] I],  Brady  and  Vuille|3j). 

(i.  Inferring  3  I)  structure  from  non  accidental  regularities 
in  the  image  eg  skewed  symmetries,  pai  ullels(Konnde|9],| 
Bin  ford  1 2  ],  Lowe  ;u;d  Binford]  10)). 


Segmentation  into  individual  bodies  and  identifying 
support  edges  is  an  essentia)  component  of  interpretation 
but  needs  to  l»c  done  in  a  general  way  as  opposed  to  ad  hoc 
heuristics  which  fail  irretrievably  in  many  situations. 

Line  labelling  using  a  predetermined  junction  catalog 
cannot  deal  with  scenes  which  violate  the  .assumptions  on 
which  the  scene  is  based.  Also,  interpretation  involves  more 
than  line  labelling — it  includes  estimates  of  metric  infor¬ 
mation.  Gradient  Space,  while  it  permits  quantitative  in¬ 
ference,  suffers  from  another  fatal  flaw — as  was  shown  quite 
strikingly  by  Draper  in  his  thcsisIS],  there  is  a  combinatorial 
explosion  in  number  of  ‘bizarre’  spatial  interpretations  pos¬ 
sible  for  a  given  line  drawing.  By  bizarre  interpretations  we 
mean  physically  realizable  configurations  which  could  give 
rise  to  the  line  drawing,  but  arc  not  usually  perceived  by 
a  human  looking  at  the  scene.  Instead,  the  goal  of  human 
vision-  and  good  machine  vision  is  to  lind  one  or  two  (in  the 
case  of  certain  well-known  illusions  like  the  Neckcr  Cube) 
most  plausible  interpretations. 

Detection  of  “impossible”  objects  (4)  is  not  a  primary 
goal  of  visual  processing.  The  human  system  and  any  useful 
computer  vision  system  deals  with  images  of  real  objects. 
Line  drawings  can  be  impossible  to  realize  as  objects  un¬ 
der  “reasonable”  assumptions.  Failure  of  I  he  interpretation 
process  to  resolve  inconsistency  results  in  the  detection  of 
impossible  objects. 

i’rcscnt  techniques  used  to  find  optimum  3-D  shapes 
corresponding  to  single  isolated  contours  ignore  cues  in  the 
line  drawing  from  other  areas,  interior  curves,  and  junc¬ 
tions.  Psychological  experiments  have  shown  the  prime  im¬ 
portance  of  these  other  cues. 

The  other  objectives  arc  all  subsumed  in  our  frame¬ 
work  in  a  clean  way  as  opposed  to  ad  hoc  techniques  used 
in  previous  work. 

3.  Image  Structure  Graph 

The  Image  Structure  graph  is  obtained  from  the  image  by 
segmentation  and  aggregation  operations.  Edge  finding  and 
linking.  Region  growing.  Corner  finding.  Perceptual  organi¬ 
zation  are  some  such  processes.  We  prefer  the  term  Image 
Structure  Graph  to  the  term  Line  drawing  as  it  is  more  sug¬ 
gestive  and  indicates  a  relevance  to  general  vision.  There 
may  be  multiple  image  structures  for  a  portion  of  an  image 
corresponding  to  hierarchies  of  description. 


The  nodes  of  the  Image  Structure  Graph  are  discrim- 
inablc  points,  curves  and  areas;  the  arcs  arc  of  two  kinds — 
<-arcs(for  topological  relationships)  and  j-arcs(for  geomet¬ 
ric  relationships).  The  t-arcs  are  of  four  kinds — bounds 
and  is-bounded-by  which  have  an  inverse  relationship,  inte¬ 
rior  and  exterior  which  have  an  inverse  relationship.  The 
g-arcs  arc  used  for  representing  relationships  like  paral¬ 
lelism, collinearity,  symmetry  etc. 

These  elements  are  found  in  the  following  way. 

1.  Discriminable  point.  In  the  image  structure  graph, 
it  corresponds  to  cither  a  tangent  or  curvature  disconti¬ 
nuity  in  a  piecewise  smooth  curve  or  to  the  intersection 
of  two  or  more  image  curves  or  the  termination  of  a 
curve. 

2.  Curve.  A  curve  in  the  image  is  the  output  of  edge 
finding  and  perceptual  organisation  processes. 

3.  Area.  An  area  in  the  Image  Structure  graph  is  a  con¬ 
nected  subset  in  an  image.  It  is  defined  by  its  boundary. 

4.  Spatial  Interpretation  Graph 

The  nodes  of  the  Spatial  Interpretation  Graph  correspond  to 
elements  of  the  scene  in  space  surfaces,  curves  and  points. 
These  are  not  the  same  as  image  structures.  We  use  the 
notation  {1*1,0,,  A,}  for  image  points,  curves  and  areas  re¬ 
spectively.  For  scene  elements  we  use  {;>,,<%,  .S’,}  for  scene 
points,  curves  and  surfaces.  There  may  be  scene  elements, eg 
hidden  surfaces,  which  have  no  corresponding  elements  in 
the  Image  Structure  Graph. 

A  surface  element  is  a  surface  or  a  surface  element  re¬ 
stricted  by  a  boundary.  A  surface  element  is  represented  as 
a  surface  and  a  boundary  or  as  point  set  operations  on  sur¬ 
face  elements  (union,  intersection,  dilference).  The  bound¬ 
ary  is  a  space  curve  which  may  be  null.  The  surface  is  also 
typically  a  boundary  of  two  volumes. 

A  similar  representation  holds  for  curves,  surfaces,  vol¬ 
umes,  ami  hypervolumes.  In  each  case,  the  boundary  has 
lower  dimension.  For  example,  a  curve  element  is  a  curve 
restricted  by  a  boundary,  i.c.  a  set  of  points,  or  by  unions, 
intersections,  and  differences  of  curve  elements. 

In  addition  to  arcs  from  a  node  pointing  to  its  bound¬ 
aries,  there  arc  also  arcs  pointing  to  the  node/s  of  which  it  is 
a  boundary  .  Also  attached  to  each  node  is  a  set  of  geometri¬ 
cal  properties  like  position.  direction(for  curves),  orientation 
(for  surface  elements),  curvature  (for  curves  and  surface  el¬ 
ements).  These  may  be  specified  cither  in  an  environment- 


centered  frame  or  viewer-centered  frame.  Heights  of  objects 
may  be  most  conveniently  expressed  as  above  the  ground 
plane  whereas  a  limb  boundary  of  a  surface  element  is  spec¬ 
ified  relative  to  the  viewpoint(  line  of  sight  grazes  the  surface 
tangentially  along  the  limb)  These  estimates  may  be  very 
sketchy  and  can  be  refined  by  other  shape-front  processes. 

5.  Modelling  the  scene  and  the  projection  process 
The  scene  consists  of  objects  which  are  connected  volumes 
bounded  by  smooth  surfaces.  A  smooth  surface  is  defined 
as  one  which  has  a  unique  tangent  plane  everywhere.  The 
points  on  the  object  where  there  is  no  unique  tangent  plane 
arc  either  isolated  points,  eg  the  apex  of  a  cone,  or  curves- 
wherc  two  smooth  surfaces  intersect,  eg  the  edges  of  a  cube. 
These  curves  are  true  edges- there  is  a  discontinuity  of  sur¬ 
face  orientation  across  them. 

We  use  a  very  simple  model  of  illumination  disuse 
lighting  and  nouspccular  surfaces.  Interpretation  under 
more  complex  lighting  conditions  giving  rise  to  spccular- 
itics,  shadows  etc  requires  more  information, eg  brightness 
values,  than  are  available  in  the  Image  Structure  Graph. 
Incorporating  these  will  be  the  subject  of  future  work. 

The  various  elements  of  the  image  structure  graph  orig¬ 
inate  in  one  of  the  following  three  ways: 

1.  Projection  of  a  single  smooth  surface.  A  single  smooth 
surface  projects  to  an  area.  Parabolic  lines  on  the  sur¬ 
face  which  correspond  to  the  transition  between  elliptic 
and  hyperbolic  patches,  may  project  to  significant  in¬ 
terior  curves  in  this  area.  The  only  other  elements  of 
the  image  structure  graph  which  ran  result  from  the 
projection  of  a  smooth  surface  correspond  to  the  sin¬ 
gularities  of  the  visual  mapping.  These  can  only  be 
folds  and  cusps[M|.  Cusps  arc  isolated  points.  Folds 
have  traditionally  been  called  limbs  or  apparent  edges 
in  vision  literature. 

2.  Intersection  of  two  or  more  smooth  surfaces.  Two  in¬ 
tersecting  surfaces  give  rise  to  a  true  edge.  This  either 
represents  a  discontinuity  of  surface  orientation  across 
it  or  a  crack  between  two  different  objects.  Three  or 
more  surfaces  intersecting  at  a  point  in  space  gives  rise 
to  a  junction.  Kdges  which  arc  space  curves  give  rise  to 
discriminable  points  corresponding  to  tangent  or  cur¬ 
vature  discontinuities  on  them. 

3.  Occlusion.  To  catalog  exhaustively  .all  the  phenom¬ 
ena  we  first  observe  that  an  n-dimcnsional  manifold 


can  occlude  only  manifolds  of  n  dimensions  or  less.  A 
point  can  occlude  another  point  only  under  a  special 
viewpoint-  occluding  point,  occluded  point,  observer  in 
the  same  straight  line.  Similarly  a  curve  can  occlude 
another  curve  or  a  point  only  under  special  viewpoint. 
If  the  occluding  manifold  is  a  surface,  then  we  have 
three  cases.  A  surface  occluding  a  point  results  in  its 
absence  in  the  image  structure  graph.  'Flic  occlusion 
of  a  curve  by  a  surface  in  the  scene  gives  rise  to  (a)  a 
T-junction  and  one  visible  segment  or  (b)  two  back  to 
back  T-junctions  and  two  visible  segments  or  (c)  com¬ 
plete  absence  of  any  corresponding  element  in  the  im¬ 
age  structure  graph.  The  occlusion  of  a  surface  S\  by  a 
surface  S>  gives  rise  to  (a)  the  elements  associated  with 
the  occlusion  of  the  curves  of  by  S*  and  (b)  areas 
corresponding  to  the  unoccluded  parts  of  S\.  These  are 
defined  by  the  curves  from  (a). 

because  of  the  finite  resolution  of  measuring  devices, 
solid  objects  may  degenerate  into  laminae  or  wires  in  the 
line  drawing. 

6.  Interpretation  Basic  Approach 
In  order  to  interpret  line  drawings  it  is  essential  to  bo  able 
to  carry  out  simple  geometric  reasoning.  To  provide  this 
ability  a  set  of  axioms/theorems  from  lOuclidean  Geometry 
and  a  logical  inference  mechanism  arc  needed.  As  projection 
is  a  many-lo  one  mapping  we  need  some  additional  assump¬ 
tions  to  facilitate  the  recovery  of  three  dimensional  struc¬ 
ture.  Any  of  these  asuinpt ions  could  he  violated  in  partic 
ula»  if  nations  they  .ire  merely  very  good  heuristics.  The 
fallible  nature  of  these  assumptions  imposes  a  constraint  on 
the  logical  inference  mechanism  it  must  be  nonmonotonic 
and  allow  the  retraction  of  deductions  invalid  in  the  pres¬ 
ence  of  further  evidence.  These  assumptions  fall  into  two 
classes: 

Assumptions  about  tho  Environment. 

1.  Opacity.  Surfaces  are  opaque.  This  permits  us  to  assign 
a  unique  surface  to  an  area  in  the  image. 

2.  Solidity.  Any  space  curve  either  bounds  two  surfaces 
or  is  an  interior  curve  on  a  surface.  This  encourages 
solid  and  lamina  interpretations  (  A  boundary  curve 
of  a  lamina  belongs  to  the  two  sides  of  a  lamina)  and 
discourages  wires  in  the  absence  of  additional  evidence. 

}SkS,[r,  C  Sk)  A  (c,  r  S,).  If  (C,  C  A,„)  A  (C,  C-  An] 
Sk,  Si  do  not  necessarily  correspond  to  A,n,An. 


3.  Support.  The  scene  consists  of  objects  which  are  in 
physical  equilibrium  ie(a)  the  resultant  of  the  body’s 
weight  and  the  reactions  at  the  points  of  support  are 
zcro.(b)  the  net,  momenta  about  points  of  support  are 
zero.  Prom  these  physical  conditions  a  set  of  more  us¬ 
able  rules  can  be  derived.  It  is  a-sumed  that  the  direc¬ 
tion  of  the  gravity  vector  is  known. 

-1.  Frequent  Structures/// elntionships.  In  both  natural 
and  man  made  scenes,  certain  spatial  ralat ionships  like 
rectangularity,  symmetry  and  verticals  occur  with  a 
much  higher  frequency  than  purely  by  chance.  The 
physical  reasons  for  this  are  not  too  hard  to  sec.  This 
enables  us  to  do  quasi  Bayesian  reasoning  and  infer  the 
presence  of  these  relationships  if  there  arc  supporting 
scenes.  This  is  best  explained  with  the  help  of  an  ex¬ 
ample.  An  orthogonal  trihedral  vertex  (OTV)  in  the 
scene  projects  either  to  an  arrow  or  a  V -junction  which 
satislies  the  Perkins  criteria  ie  cost),  cos0_>  ros/P)  <  0 
where  0 ,  ,<?2 .  O.i  ilrP  the  angles  between  the  rays  in  the 
image.  If  the  Perkins  criteria  are  satisfied  by  the  junc¬ 
tion,  then  either  it  could  b»  a  genuine  OTV  or  a  con¬ 
figuration  of  three  straight  lines  in  spare  which  just 
happens  to  project,  to  an  image  which  satislies  the  cri¬ 
teria.  The  assumption  of  rect, angularity  is  commonly 
made  by  hmnans|12].  The  skewed  symmetry  heuristic 
of  Kanadc  is  another  example.  Proper  applications  of 
this  assumi  i ion  require  that  the  a  priori  probability  of 
the  struct  ue  being  in  the  scene  be  sufficiently  high. 

Assumptions  about  the  Viewpoint. 

1.  Non-nccitlcntnl  Viewpoint.  Interpretations  which  re¬ 
quire  the  viewpoint  to  be  from  a  set  with  a  two 
dimensional  measure  of  zero  are  excluded.  This  as¬ 
sumption,  originally  due  to  Huffman,  is  best  illustrated 


image  curves  C i,  (’2,  C3  correspond  to  space  curves 
r  1  .c2 >c3  which  intersect  in  space  at  a  vertex  pi(invcrse 
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of  Pi).  For  this  to  be  an  accident  the  endpoints  of  the 
three  curves  would  have  to  be  along  the  same  ray  from 
the  observer.  This  corresponds  to  a  two-  dimensional 
measure  of  zero.  Alternatively  we  can  treat  this  sta¬ 
tistically.  The  two  visible  components  of  the  difference 
vector  are  =  <  each  where  e  is  the  size  of  a  pixel.  Given 
this,  the  maximum  likelihood  estimate  of  the  third  com¬ 
ponent  is  of  the  order  of  e. 

2.  Terrestrial  Viewpoint.  The  scene  consists  of  objects  ly¬ 
ing  on  or  above  a  ground  plane  being  viewed  by  an 
observer  above  the  ground  plane.  This  assumption,  ob¬ 
viously  useful  from  ecological  perspective  for  humans 
and  other  landbased  animals  is  an  important  compo¬ 
nent  of  our  perception  of  line  drawings.  It  enables  us 
to  choose  among  Necker  flip  related  interpretations. 

7.  Some  Inference  Rules 

We  view  the  process  of  interpretation  as  one  of  constructing 
a  simple  explanation  for  the  Image  Structure  graph.  This 
immediately  raises  the  questions-  What  is  simple ?  What 
is  an  explanation!  In  our  view  of  things  an  explanation 
is  a  SIG  which  under  the  projection  process  described  in 
Section  5  would  result  in  the  the  given  ISG.  The  notion  of 
simplicity  is  twofold  (a)  Occam’s  razor  "Entities  will  not 
be  multiplied  without  necessity" (b)  Minimise  the  number  of 
violations  or  the  perceptual  principles  stated  in  section  G. 
We  describe  here  some  inference  rules  derived  from  these 
considerations.  The  Rules  for  interpreting  the  ISG  can  be 
partitioned  into  three  groups: 

1.  Assigning  a  spatial  interpretation  for  each  node  of  the 
ISG  and  the  t-arcs. 

2.  Introducing  additional  constraints  in  the  SIG  corre¬ 
sponding  to  the  g-arcs  in  the  ISG, 

3.  Making  necessary  geometric  inferences  and  enforcing 
consistency. 

1.  Instantiation  llnlvs.  A  junction, ic  a  discriminablc  point 
in  the  image  corresponds  to  a  unique  point  in  space.  A 
curve  in  the  image  corresponds  to  a  unique  curve  in 
space.  An  .area  in  the  image  corresponds  to  a  unique 
surface  in  space.  This  enables  us  often  to  use  the  same 
indices  for  corresponding  structures  P,  — »  p, ,  Cy  — >  cy, 
Ak  — *  S*.  This  notational  convention  will  be  used  in 
the  rest  of  the  paper,  unless  otherwise  stated. 

The  next  four  rules  enable  us  to  deduce  which  curves  belong 
to  which  surfaces. 


2.  Edge  Rule.  If  (C,  C  An)  A  (C,  C  A|),ic  the  image  curve 
bounds  A*,  A/,  then  (c,  >  S*)  and  (c,  >  Si)  fe  the  edge 
is  not  behind  cither  surface.  We  will  use  >  to  denote 
this  view  point  dependent  relationship. 

3.  Arrow  Ride.  An  arrow  junction  is  defined  to  be  one 
where  two  of  the  angles  arc  less  than  ir/2  and  the  third 
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Cz 


and  c3  e  52.  This 
rule  is  interesting  because  to  derive  it,  it  is  necessary 
to  invoke  the  notion  of  minimum  number  of  surfaces 
which  can  explain  the  junction. 

T-Junction.  For  the  'I'  junction  shown  c2  >;  c(.  There 
are  two  cases 


s, 


s. 


1  Occlusion  T  junction.  c2  >  A  c-i  F  S i 

2  Object  Alignment  Si  belongs  to  a  different  object 
than  £>2,  S3. 

Limb- T  Junction. 


A  limb-T  junction  is  defined  by  the  fact  that  all  the 
three  curves  C’i,  C2,  Cj  share  a  common  tangent  at  Pt. 
Here  It  can  be  shown  that  c3  £  Si,  c2  C  S2,  C(  G  Si- 
6.  Continuity  Rules  Here  we  define  zeroth  ordcr  conlinuity 
to  be  coincidence  in  space,  first  order  continuity  to  be 
continuity  in  slope  and  second  order  continuity  to  be 
continuity  of  curvature.  Here  are  the  rules  as  they  apply 
to  curves. 

2.1  Zeroth  order.  Curves  which  intersect  in  the  image 
intersect  in  space. 

2.2  First  order.  A  curve  which  is  smooth  in  the  image 


is  smooth  in  space. 

2.3  Second  order.  Segments  of  the  curve  which  have 
continuous  curvature  in  the  image  have  continuous 
curvature  in  the  scene. 

There  are  corresponding  rules  for  surfaces.  Here  discon¬ 
tinuities  can  be  along  points  or  curves.  An  area  which 
has  no  visible  creases  has  smoothly  varying  curvature. 
This  provides  us  with  a  way  to  estimate  surfaces. 

7.  Surface  Estimation  A  surface  is  estimated  by  first  es¬ 
timating  relative  position,  orientation,  and  curvature 
at  sparse  locations  on  the  surface  at  which  curves  arc 
visible.  Surfaces  and  curves  arc  regarded  as  splines 
through  sampled  points.  For  example,  if  C\  is  straight 
and  Ci  ->  c |  then  assume  Ci  is  straight  in  space  and 
that  the  surface  cl  lies  on  is  singly-curved.  If  C i  and 
C2  arc  straight  and  intersect,  then  in  absence  of  further 
evidence  assume  that  the  surface  on  which  they  lie  is 
planar.  This  can  be  generalized  further,  as  the  next 
rule  shows. 

8.  Coplanarity  Rules.  A  set  of  straight  lines  {  cj,...,cn  } 
is  said  to  be  coplanar  iff  3S.planar(S)  Ac,  &  S.  The 
following  three  rules  enable  us  to  deduce  coplanarity: 

inter  sector.  hc2)  =>  coplimar({a,  C2}) 
pariiltel(c i,c2)  =>  c.oplanar({ci,c2}) 

(intersect (c  1, ck)  V  parallel(c.  1,0*.))  A intcrsect(ci,ci) 

Ac*  C.  S  A  c/  C  .S’  A  coplannr(S)  >  coplunar(S  U  {c|  }) 

9.  I’lanc  attachment  If  cf,  c.k  both  elements  of  a  coplanar 
set  {ci,C2, . . .  ,c„}  satisfy  c }  O  S,  ck  G  6'  for  some 
surface  .S’,  then  ci,c2, . . .  ,cn  arc  all  £  S. 

10.  OTV  reasoning.  Suppose  that  some  3-star  has  angles 
between  its  rays  a,  b  and  c  and  also  that  the  rays  are 


Since  projection  is  accomplished  by  dropping  the  1- 
componcnt,  the  vectors  in  space  must  be  of  the  form 


t?i  +  A|Z,  v2  +  A 2z,  and  u3  -F  A3z  where  z  is  the  unit 
vector  in  the  z-direction. 

Requiring  mutual  orthogonality  implies  that  the  dot 
products  of  these  vectors  in  pairs  be  zero.  From  these 
conditions  and  some  simple  manipulations  we  can  cal¬ 
culate  the  formulas  for 


Hence  solutions  exist  if  (a)  cos  a,  cos  b,  cos  c  arc  all  non¬ 
zero  and  (b)  cither  one  or  three  of  cos  a,  cos  6,  cos  c  are 
negative,  so  that  the  quantities  under  the  square  root 
sign  arc  positive. 

These  results  were  first  derived  by  Perkins  [12] 

8.  Control  Structure  Specifications 
The  control  problem  is  to  be  handled  by  a  meta  level  archi¬ 
tecture.  A  deliberation  action  loop  is  at  the  heart  of  such 
an  approach.  There  .are  three  main  reasons  why  wc  need  to 
use  mcta-lcvcl  rules  to  direct  the  inference  process: 

Reduce  Search 

In  order  to  minimise  search,  the  deductions  proceed  from 
the  most  reliable  to  the  least  reliable.  This  least  commit¬ 
ment  style  of  reasoning  is  a  major  characteristic  of  human 
perception  as  pointed  out  among  others  by  Pcrkins[12j.  We 
have  found  in  our  numerous  hand  simulations  that  using  the 
following  meta  heuristics,  the  search  is  negligible. 

1.  Instantiate  Spatial  Graph  from  Image  Graph  first. 

2.  Consider  neighboring  vertices  together  and  prune  off 
inconsistencies. 

3.  Use  planarity  early. 

4.  Use  T  junctions  early. 

5.  Use  geometrical  rules  at  once  if  they  do  not  require 
any  case  analysis. 

6.  Act  immediately  whenever  ail  exceptional  condition 
is  flagged. 
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Nonmonotonic  reasoning 

Most  of  our  reasoning  rules  have  exceptions —  wires,  sur¬ 
face  marks  ,  accidental  coincidences  of  viewpoint  etc.  These 
are  detected  by  cither  detecting  contradictions  or  by  di¬ 
rect  inference.  When  contradictions  are  detected,  a  TVuth 
Maintenance  process  gets  into  action-  the  olfending  fact/s 
are  detected  and  conclusions  drawn  from  those  retracted. 
The  Truth  maintenance  process  can  be  represented  by  a  set 
of  meta  rules  which  operate  on  the  inference  graph.  The 
second  way  exceptional  conditions  are  detected  is  by  ob¬ 
ject  level  rules  eg  X-junctions  denote  either  wires  or  trans¬ 
parency.  Here  again  meta-level  rules  enable  the  retraction 
of  previous  conclusions. 

Dealing  with  noisy  data 

The  input  to  our  program-  the  linage  structure  Graph-  is 
obtained  as  the  output  of  an  edge-finding  process  which 
does  not  yield  perfect  line  drawings.  Lines  may  have  missing 
segments,  there  may  be  more  than  one  line  corresponding  to 
a  single  physical  edge  and  so  on.  These  get  reflected  in  the 
conclusions  drawn  what  seems  to  be  a  surface  mark  may 
actually  be  a  true  edge  with  a  missing  segment.  Again  this 
can  be  handled  by  meta-rules  like  -  If  yon  detect  a  surface 
mark,  look  for  a  continuation  mark.  A  model  of  the  edge- 
finding  process  is  required  for  this  stage  in  order  to  be  able 
to  spot  likely  errors. 

10.  An  Example-  The  cube  with  a  hole. 


Note  that  the  labelling  shown  in  the  figure  is  that  of  the 
corresponding  nodes  in  the  Spatial  Interpretation  Graph. 


1.  Using  the  Instantiation  Rule,  A  SIG  is  hypothesised 
with  point  nodes  pi,...,pi2,  edge  nodes  Ci,...,Cj3, 
surface  nodes  Si , . . . ,  Sg. 

2.  Using  the  Coplanarity  rules,  deduce  the  coplanar  sets 
{ce,c7,e8,ca},  {co.cngcu.cij},  {cs.cio.cu.cn},  and 
{ci,C2,C3,C4}. 

3.  Using  the  arrow  rule  it  may  be  deduced  that 

3.1  Cg  £  Si, Co  £  Si, Co  £  S2,c  12  £  S?  from  Pg. 

3.2  Cn  £  S2,Cio  G  S2,cm  G  S3,c i2  G  S3  from  Pm. 

3.3  c7  G  Si,cg  £  Si,cg  £  S3 ,c 1 3  G  S3  from  Pg. 

3.4  Ci  G  S4.Cs  G  S4.Cs  G  Sg,c3  G  Sg  from  P2. 

4.  Using  the  results  of  steps  2,3  and  the  Plane  Attaching 
Inference  Rule,  it  can  be  shown  that 

Cg,  c7,  eg,  Co  are  all  G  Si 
co>  cio>  cn>  ci2  are  nil  £  S7 
eg,  c io,  Ci2,  cJ3  arc  all  6  S3 

5.  The  T-Junction  rule  applied  to  p*  gives  cither  c4  >- 
eg  A  C4  £  Si  or  there  is  object  alignment. 

6.  Consider  c3.  It  is  parallel  to  ci  which  intersects  Sg  at 
only  one  point.  As  Sg  is  planar,  Ci  and  c3  arc  straight, 
it  follows  thatlfi  G  Sg)r>  C3GS,  SigtBula. 

7.  As  ci,  cj  6  Si  we  can  use  the  Plane  Attachment  rule 
to  infer  that 

Ci,  c2,  c3,  c4  arc  all  G  S( 

8.  The  Perkins  criteria  arc  satisfied  at  the  junctions.  We 
can  apply  the  OTV  Rule  to  deduce  directions  at  the 
junctions  and  from  that  compute  the  orientations  of 
the  surfaces  containing  them.  Support  on  Sc  and  using 
the  Terminal  Viewpoint  assumpion,  we  arrive  at  the 
complete  interpretation. 

This  example  was  chosen,  because  it  was  considered 
interesting.  We  have  tried  our  approach  on  numerous  other 
examples,  including  with  curved  surfaces. 
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ABSTRACT 

In  this  paper  we  will  describe  a  combined  region  and  edge  rep¬ 
resentation  which  is  used  to  guide  the  transformation  of  data  from  a 
low  level  representation  to  a  flexible  intermediate  level  of  represents- 
tion.  This  intermediate  level  of  representation  in  turn  provides  infor¬ 
mation  for  the  control  of  a  multiresolution  segmentation  algorithm  via 
multiresolution  models.  The  system  employs  a  goal-oriented  focus  of 
attention  mechanism  for  directing  the  system  to  examine  only  areas 
where  a  coarse  level  segmentation  yields  hypotheses  of  object  classes 
being  sought.  The  experiments  are  carried  out  in  the  domain  of  digital 
cartography  on  a  4090  x  4090  pixel  aerial  photograph. 

I.  Introduction 

The  vast  quantity  of  data  available  in  the  form  of  aerial  and  satel¬ 
lite  imagery  far  exceeds  the  processing  capacity  of  current  computing 
environments.  Efficient  processing  of  large  amounts  of  data  requires  se¬ 
lective  focussing  of  expensive  detailed  analyses  on  that  fraction  of  the 
image  data  which  is  likely  to  contain  objects  of  interest.  In  many  cases 
it  might  be  possible  to  only  roughly  interpret  or  even  ignore  large  areas 
of  the  image.  Such  a  processing  methodology  implies  that  goal-oriented 
analyses,  controlled  by  strategies  which  utilize  both  the  domain  knowl¬ 
edge  as  well  as  the  goal  constraints,  are  required.  One  focus  of  attention 
method  involves  reducing  the  spatial  resolution  of  the  data  and  using 
information  from  coarser  levels  to  direct  the  processing  at  flner  levels. 
We  propose  a  hierarchical  multi-resolntion  interpretation  based  on  the 
VISIONS  system  architecture. 

The  most  suitable  methods  tot  applying  such  selective  processing 
to  imagery  of  this  size  are  the  multi-resolntion,  or  pyramid,  techniques 
|TAN80|.  Typically,  from  the  original,  large-scale,  toll  resolution  aerial 
image  is  constructed  a  progression  of  smaller  and  smaller  images,  each 
covering  the  same  extent,  but  at  successively  coarser  resolution.  While 
many  variations  on  this  structure  have  been  proposed,  a  pyramid  is 
typically  constructed  by  dividing  the  image  up  into  disjoint,  two-pixel 
by  two-pixel  blocks.  Prom  each  block  a  single  pixel  of  the  image  at 
the  next  coarser  resolution  is  computed,  by  taking  the  mean  or  median 
pixel  value  of  the  block. 

The  underlying  computational  model  is  provided  by  the  processing 
cone  [RIS74,  HAN7Sa,  HAN 80,  GLASS],  a  pyramid  structure  imple¬ 
mented  as  a  parallel  array  computer  hierarchically  organized  into  levels 
of  increasing  (or  decreasing)  spatial  resolution.  Level  i  is  2*  •  T  pixels, 
hence  a  4098x4098  image  would  reside  at  level  12;  i=0  corresponds  to 
an  image  of  one  pixel  and  b  the  ‘top’  of  the  cone.  Processing  b  ac¬ 
complished  by  executing  a  prototype  function  in  parallel  on  windows 
of  data  at  level  i;  data  may  be  processed  at  a  given  level,  up  the  cone 
from  8ner  levels  to  coarser  leveb,  and  projected  from  coarser  leveb  to 
flner  leveb.  The  inter-level  communication  provides  a  mechanism  by 
which  information  at  coarser  leveb  can  direct  more  detailed  processing 
at  flner  leveb  in  the  hierarchy. 

Thb  work  was  supported  in  part  by  the  Air  Force  Office  of  Sci- 
eatiflc  Research  under  contract  number  F49620-83-C-0099and  in  part 
by  Rome  Air  Development  Center  under  task  number  1-4-0055. 


The  experiments  described  in  thb  paper  were  conducted  on  a  4096 
x  4098  digitized  aerial  photograph  (see  Figure  1.1)  with  the  goal  of  ef- 
fleiently  extracting  instances  of  a  variety  of  object  classes  stored  in  the 
knowledge  base.  Thb  in  fact  represents  a  small  image  in  the  context 
of  digital  cartography.  They  were  performed  using  a  software  envi¬ 
ronment  which  interfaces  the  user  to  virtual  processing  cones  through 
LISP  |KOH82,  KOH83,  KOH84).  Thb  also  provides  a  natural  inter¬ 
face  to  the  knowledge  base  and  interpretation  system.  The  preliminary 
results  described  later  represent  a  step  in  the  direction  of  developing 
object  and  scene  modeb  associated  with  each  level  in  the  processing 
cone. 

II.  Interpretation 

The  VISIONS  system  b  an  experimental  testbed  for  investigat¬ 
ing  the  construction  of  integrated  computer  vbion  systems  for  image 
understanding.  The  goal  b  to  provide  an  analysis  of  images  from  seg¬ 
mentation  through  the  Bnal  stages  ol  symbolic  interpretation.  The 
output  of  the  system  b  intended  to  be  a  symbolic  representation  of 
important  world  events  depicted  in  the  image. 

The  low-level  segmentation  system  b  responsible  for  decompos¬ 
ing  the  original  image  into  easily  manipulatable  visual  primitives  sneh 
as  regions  and  lines,  and  their  attributes  such  as  color,  texture,  size, 
shape,  orientation,  length,  etc.  While  initially  not  dependent  upon 
scene  knowledge,  the  segmentation  process  can  be  made  more  context 
sensitive  as  the  interpretation  process  continues  via  feedback  from  se¬ 
mantic  processes.  An  interpretation  b  created  in  Short  Term  Mem¬ 
ory  by  grouping  the  vbual  primitives  in  various  ways  and  linking  them 
to  semantic  labeb  under  the  constraints  imposed  by  the  knowledge 
base,  which  b  referred  to  as  Long  Term  Memory.  Thb  process  b 
accomplished  by  applying  sequences  of  knowledge  sources  which  are 
modular  process  governing  the  transformation  of  data  between  particu¬ 
lar  leveb  of  representation  [PAR80].  One  important  class  of  knowledge 
sources  that  will  be  heavily  used  b  that  of  a  rule-oriented  object  hy- 
pothesb.  Knowledge  source  application  takes  place  under  the  guidance 
of  a  control  strategy  and  extends  a  partially  constructed  interpreta¬ 
tion  of  the  particular  image  (see  |HAN78c]). 


II.l.  Multiple  Resolution  Images  and  Models 
Object  modeb  should  be  represented  in  terms  of  the  image  events  (i.e., 
image  features)  which  it  b  possible  to  extract  from  an  image,  as  op¬ 
posed  to  image  events  which  it  b  desirable  to  extract  from  an  image 
|MAR82|.  Thb  reasonable  design  observation  must  be  adapted  to  the 
multi-resoJutios  architectures  where  features  of  sue,  color/intensity, 
texture,  shape,  orientation,  etc.  for  particular  objects  might  only  be 
extracted  at  particular  leveb.  For  example,  given  knowledge  of  the 
camera  position  in  an  aerial  photograph,  the  size  of  a  region  in  pixeb 
of  a  given  object  can  be  predicted  at  the  various  leveb  of  the  pyramid. 
However,  the  type  of  averaging  (or  in  general  the  type  of  data  reduction 
transformation)  used  in  constructing  the  higher  leveb  of  the  pyramid 
can  dramatically  alter  the  appearance  of  an  object  and  either  help  or 
hinder  various  algorithms  for  detecting  these  objects. 
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la  order  to  reliably  extract  an  object,  the  knowledge  representa¬ 
tion  problem  also  involves  specifying  a  description  of  the  sequence  of 
processes  which  mast  be  applied  at  varions  levels  of  the  pyramid.  These 
control  processes  are  of  coarse  not  isolated  since  each  object  establishes 
a  context  which  the  the  other  processes  can  take  advantage  of.  The 
control  strategies  can  be  constructed  to  make  ase  of  multi-resolation 
object  models  which  capture  the  set  of  visual  events  -  i.e.,  different 
object  class  features  -  that  are  most  appropriate  at  the  different  levels 
of  resolution.  Multi-resolution  object  models  can  be  defined  in  terms 
of  the  line  and  region  attributes  that  potentially  can  be  extracted,  the 
reliability  and  importance  of  these  features,  the  processes  used  to  hy¬ 
pothesize  and  verify  objects,  and  the  relations  between  objects. 

II. 1.  Rule- Based  Object  Hypotheses 

A  simple  type  of  knowledge  source  for  generating  hypotheses  of  object 
class  labels  for  particular  regions  has  been  developed  recently  in  the 
VISIONS  environment  |WEY83|.  They  take  the  form  of  simple  rules 
defined  in  terms  of  ranges  over  a  scalar  feature,  and  complex  rules 
defined  as  combinations  of  the  output  of  a  set  of  simple  rules.  The 
rules  can  also  be  viewed  as  sets  of  partially  redundant  features  each  of 
which  defined  an  area  of  feature  space  which  represents  a  ‘vote'  for  an 
object.  Theumage  features  include  color,  texture,  shape,  size,  image 
location,  and  relative  location  to  other  objects  and  the  line  features 
include  length,  orientation,  contrast,  width,  etc. 

The  basic  idea  is  to  form  a  mapping  from  a  measured  value  of  the 
feature  obtained  from  on  image  region,  say  /f ,  into  a  ‘vote"  for  the 
object  on  the  basis  of  this  single  feature.  One  approach  to  defining  this 
mapping  is  based  on  the  notion  of  prototype  vectors  and  the  distance 
from  a  given  measurement  to  the  prototype,  a  well-known  technique 
which  extends  to  N-dimensional  feature  space.  In  our  case  rather  than 
using  this  distance  to  “classify" .  we  translate  it  into  a  “vote"  by  defining 
the  response  of  the  rule  as  function  of  the  distance. 

In  many  coses,  it  is  possible  to  define  rules  which  provide  evidence 
for  and  against  the  semantically  relevant  concepts  representing  the  do¬ 
main  knowledge.  While  no  single  rule  is  totally  reliable,  the  combined 
evidence  from  many  such  rules,  some  of  which  may  be  relatively  com¬ 
plex  and  costly,  should  imply  the  correct  interpretation.  One  function 
of  the  focussing  mechanisms  is  to  generate  constraints  on  the  applica¬ 
tion  of  the  more  costly  rules. 

III.  Integrating  Region  and  Linn  Information 
In  on  Interpretation  Process 

We  are  using  a  combined  region  and  edge  representation  based  on 
two  low-level  algorithms:  the  Nagin-Kohler  region  segmentation  algo¬ 
rithm  |NAG79,  NAG82,  KOH83.  KOH84|  and  the  Burns  linear  feature 
extraction  algorithm  |BUR84|.  Both  of  these  algorithms  are  briefiy  re¬ 
viewed  bek>w.  A  useful  characteristic  of  the  Nagin-Kohler  algorithm 
that  will  be  exploited  in  our  system  is  its  simple  localization  process 
which  can  be  applied  selectively  on  various  subsets  of  the  image  data 
The  algorithm  determines  feature  cluster  peaks  in  local  subimages,  and 
then  forms  regions  based  upon  these  histogram  labels. 

The  Burns  algorithm  is  a  new  and  robust  approach  for  extracting 
linear  features  in  intensity  images,  including  low-contrast  linear  fea¬ 
tures.  It  provides  a  low-level  representation  of  intensity  variations  by 
segmenting  the  intensity  surface  into  connected  subsets  of  pixels  which 
have  similar  gradient  orientation;  these  regions  act  as  ‘edge-support 
regions'  of  a  linear  feature,  and  various  characteristics  of  the  associ¬ 
ated  line  or  edge  can  be  extracted  from  them.  Thus,  both  regions  and 
lines  have  a  common  pixel  based  representatioB  which  is  a  tremendous 
advantage  for  information  integration. 


The  simultaneous  use  of  both  region  and  edge  information  permits 
two  types  of  perceptual  grouping  processes  to  take  place.  On  the  one 
hand  a  region  or  set  of  regions  can  guide  the  grouping  of  edges,  while 
on  the  other  hand  a  line  or  set  of  lines  can  guide  the  grouping  of  re¬ 
gions.  For  example,  the  set  of  lines  on  the  boundary  of  a  region  can  be 
collected  and  shape  descriptors  obtained,  or  the  set  of  short  high  con¬ 
trast  lines  within  the  region  con  serve  as  the  basis  of  texture  measures. 
Similarly  all  the  regions  adjacent  to  a  long  straight  line  con  be  grouped, 
yielding  a  strategy  for  obtaining  semantically  significant  collections  of 
regions.  The  representation  of  intensity  and  edge  information  as  re¬ 
gions  facilitates  the  implementation  of  these  grouping  processes  within 
the  hierarchical  structure  of  the  processing  cone. 


III.l.  Region  Segmentation 

The  region  segmentation  technique  employed  was  first  developed  by 
Nagin  [NAG79]  and  extended  by  Kohler  [KOH83,  KOH84].  The  ap¬ 
proach  is  in  the  spirit  of  the  Ohlander-Price  algorithms  [OHL78,  PRI84, 
HAN84),  but  we  believe  it  is  more  robust  in  certain  situations.  The  al¬ 
gorithm  involves  detecting  clusters  in  a  feature  histogram,  associating 
labels  with  the  clusters,  mapping  the  labels  onto  the  image  pixels,  and 
then  forming  regions  of  connected  pixels  with  the  same  label.  The  pro¬ 
cess  of  global  histogram  labeling  causes  many  errors  to  occur  because 
global  information  will  not  accurately  refiect  local  image  events  which 
do  not  involve  large  numbers  of  pixels  but  which  nevertheless  ore  quite 
clear.  The  Nagin  algorithm  overcomes  this  difficulty  by  partitioning  the 
image  into  NxN  subimages  (usually  N  =  18  or  32)  called  sectors.  The 
sectors  are  each  expanded  to  overlap  the  adjacent  sectors  by  25%,  and 
then  the  histogram  segmentation  algorithm  is  applied  independently 
to  each  sector.  Thus,  each  sector  receives  the  full  focus  of  the  clus¬ 
ter  detection  process  and  many  of  the  problems  of  cluster  overlap  and 
‘hidden*  clusters  ore  significantly  reduced.  A  postprocessing  stage  is 
applied  to  merge  selected  regions  that  have  been  artificially  split  along 
sector  boundaries. 

The  partitioning  of  the  image  into  sectors  has  an  obvious  weak¬ 
ness.  It  an  adjacent  sector  has  a  visually  distinct  region  which  does  not 
overlap  the  central  sector  sufficiently,  it  is  quite  possible  that  the  cluster 
will  be  undetected  in  the  central  sector.  The  small  region  representing 
the  intrusion  into  the  central  sector  will  then  be  lost.  The  Kohler  al¬ 
gorithm  improves  the  clustering  step  by  adding  candidate  peaks  from 
surrounding  sectors  to  the  set  of  peaks  selected  for  the  central  sector. 
The  augmented  set  of  peaks  forms  the  basis  of  the  labeling  process. 
The  algorithm  can  be  summarized  in  five  steps: 

1)  Subdivide  the  image  into  sectors  and  select  cluster  labels  in 
each  expanded  sector. 

2)  Analyze  cluster  labels  from  adjacent  sectors  for  augmenting  the 
label  set. 

3)  Use  the  expanded  set  of  labels  to  segment  each  sector. 

4)  Remove  sector  boundaries  by  merging  similar  regions. 

5)  Perform  small  region  suppression  (which  often  reduces  the  num¬ 
ber  of  regions  by  s  factor  of  4). 

Figure  3.1  shows  the  result  of  applying  the  first  3  steps  above  to 
data  at  level  8  in  the  processing  cone.  The  level  8  data  was  obtained 
from  the  level  12  data  by  averaging  over  2x2  windows  from  level  to 
level.  Figure  3.2  is  the  segmentation  after  elimination  of  sector  bound¬ 
aries  and  small  region  suppression. 
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IB.*.  Hierarchical  Focus  of  Attention 

The  region  segmentation  algorithm  caa  be  applied  to  any  subset  of  the 
set  of  available  sectors.  Starting  at  a  coarse  resolution  image  (level  i), 
regions  are  selected  which  potentially  correspond  to  a  specific  object 
modeled  in  the  knowledge  base.  The  sectors  intersecting  these  regions 
are  determined  and  used  to  select  the  corresponding  set  of  sectors  at 
the  next  finer  level  of  resolution  (level  i  +  1).  The  algorithm  is  applied 
again  but  only  within  this  new  subset. 

Thus  the  processing  methodology  we  are  employing  can  be  sum¬ 
marized  as  repeated  execution  oh 

1.  Segment  the  image  at  some  level  i; 

2.  Select  a  sub-collection  of  the  regions  satisfying  some  object 
model; 

3.  Select  those  sectors  which  have  a  non-empty  intersection  with 
one  of  the  selected  regions; 

4.  Project  those  sectors  to  level  i  +  1  obtaining  a  collection  of 
secton  at  level  i  +  1; 

5.  Re-segment  only  within  the  secton  obtained  in  step  4. 

For  large  imagery  of  the  type  considered  in  this  paper,  the  com¬ 
putational  advantage  of  this  approach  is  significant.  For  example  if 
we  assume  that  even  2/3  of  the  possible  secton  are  used  at  each  level, 
then  only  1  /5  of  the  image  is  being  examined  4  levels  down.  In  the  case 
that  only  1/4  of  the  secton  are  selected,  only  1/ 250th  of  the  image  is 
being  searched  4  levels  down.  In  addition,  the  knowledge  base  can 
be  structured  to  be  level  dependent  such  that  at  finer  levels  of  resolu¬ 
tion  the  description  of  objects,  and  hence  the  intrencing  process,  will 
be  more  complicated,  while  at  the  same  time  the  system  is  looking  at 
only  a  small  fraction  of  the  image.  With  these  types  of  strategies  the 
computational  complexity  of  the  process  caa  be  kept  within  reasonable 
bounds. 

III.3.  Finding  Buildings 

In  the  aerial  image  of  Figure  1.1,  many  of  the  large  buildings  appear 
as  small,  bright  regions  at  coarse  levels  of  resolution.  Although  such  a 
simple  selection  rule  allows  the  system  to  hypothesize  likely  locations 
of  buildings,  the  evidence  is  too  unreliable  to  permit  actual  labeling  of 

individual  buildings.  The  coarsest  level  where  reliable  shape  features 
can  be  extracted  is  level  10  (1024  x  1024),  although  to  exhaustively 
search  level  10  for  buildings  would  be  very  expensive.  Rather,  the 
system  starts  building  candidate  regions  at  a  coarser  level  (here,  level 
8)  and  builds  a  mask  which  specifies  likely  areas.  The  mask  is  refined 
level  by  level  until  more  expensive  features  caa  be  extracted  at  level 
10,  but  now  focussing  only  on  the  likely  areas. 

Using  the  level  8  Nagin-Kohler  segmentation  (Figure  3.2),  size  and 
brightness  features  were  calculated  for  each  region.  Figure  3.3  shows 
regions  whose  average  brightness  is  greater  than  180,  regions  whose  size 
is  less  than  70  pixels,  and  the  intersection  of  both  sets  of  regions.  By 
this  criteria,  any  sector  intersecting  a  bright,  small  region  was  marked. 
The  resulting  mask  is  shown  in  figure  3.4.  Using  the  pyramid  structure 
to  project  the  mask,  the  segmentation  algorithm  was  applied  under  this 
mask  at  level  9  (figure  3.S).  Note  that  the  intensity  range  in  the  rule 
at  one  level  could  be  different  than  those  at  other  levels,  and  that  the 
size  range  has  a  natural  scaling  to  the  lower  levels.  The  features  of 
intensity  and  size  were  calculated  for  this  new  segmentation,  and  the 
upper  left  hand  quadrant  is  shown  in  figure  3.8.  A  new  mask  is  created 
by  selecting  a  sector  if  it  intersects  a  region  satisfying  a  size  and  inten¬ 
sity  constraint  (figure  3.7).  This  mask  is  projected  to  level  10  and  the 
Nagin-Kohler  algorithm  is  applied.  Again,  regions  are  selected  on  the 
basis  of  size  and  intensity  criteria  and  the  result  is  displayed  in  figure 
3.8. 


III.4.  Linear  Feature  Extraction 

Intensity  changes  in  images  are  due  in  part  to  reflectance,  depth,  ori¬ 
entation  and  illumination  discontinuities  and  the  interpretation  of  such 
changes  is  fundamental  to  computer  vision.  The  organization  of  sig¬ 
nificant  local  intensity  changes  -  called  ‘edges*  -  into  more  global 
coherent  events  -  called  “lines*  or  ‘boundaries*  -  is  an  early  but  im¬ 
portant  step  in  the  transformation  of  the  visual  signal  into  useful  in¬ 
termediate  constructs  for  interpretation  processes.  Despite  the  vast 
amount  of  research  appearing  in  the  literature,  even  the  extraction  of 
linear  boundaries  has  remained  a  difficult  problem  in  many  image  do¬ 
mains.  There  are  two  goals  of  the  approach  presented  here:  a)  the 
development  of  mechanisms  for  extracting  linear  features  from  com¬ 
plex  images,  including  low  contrast  lines;  and  b)  the  construction  of 
an  intermediate  representation  of  edge/line  information  which  allows 
high-level  mechanisms  efficient  access  to  relevant  image  events.  A  more 
detailed  presentation  can  be  found  in  [BUR84|. 

The  linear  feature  algorithm  is  based  on  the  contention  that  edge 
magnitude  (local  intensity  change)  is  inherently  unreliable  as  a  local 
estimate  of  the  meaningfulness  of  edge  events.  Some  of  the  problems 
with  using  edge  magnitude  include  those  caused  by: 

a)  discrete  sampling  -  sharp  intensity  changes  involve  ‘mixed  pix¬ 
els"  which  result  in  the  full  edge  contrast  being  distributed 
across  several  pixels; 

b)  wide  gradients  -  total  edge  contrast  may  be  distributed  across 
many  pixels  so  that  no  optimum  edge  operator  size  can  be  fixed 

a-priori  (and  there  are  no  uniformly  acceptable  mechanisms  for 
combining  responses  from  different  operator  sizes  and  place¬ 
ments); 

c)  world  events  -  a  change  in  background  or  foreground  objects 
along  an  edge  may  cause  the  local  magnitude  estimates  and 
total  edge  contrast  to  vary  in  arbitrary  ways. 

If  the  entire  set  of  pixels  associated  with  a  particular  linear  feature 
were  determined,  then  the  decision  about  the  line  could  be  made  glob¬ 
ally  as  opposed  to  locally;  in  purely  local  decisions,  weak  or  noisy  edge 
information  could  confuse  the  process  which  locates  the  edge.  While  we 
cannot  provided  detailed  discussions  of  the  issues  here,  the  use  of  gra¬ 
dient  orientation  as  an  early  organizing  criterion  appears  to  be  effective 
in  overcoming  these  problems. 

The  general  approach  to  extracting  linear  image  events  is  to  group 
the  pixels  into  'edge  support*  regions  of  similar  gradient  orientation, 
and  then  to  treat  each  region  as  a  representation  of  a  straight  line 
segment.  Note  that  every  intensity  variation  (including  very  low  mag¬ 
nitude  changes)  will  initially  be  extracted  as  a  line  segment.  During 
the  interpretation  of  these  events,  adjacent  low  contrast  support  re¬ 
gions  can  be  grouped  into  homogeneous  regions  and  filtered  so  that 
they  are  not  viewed  as  weak  linear  events. 

There  are  four  basic  steps  in  extracting  linear  image  events: 

a)  Group  Pixels  into  Edge-Support  Regions.  Sets  of  pixels  are 
grouped  based  upon  similarity  of  gradient  orientation.  This  al¬ 
lows  data  directed  organization  of  edge  contexts  without  com¬ 
mitment  to  masks  of  a  particular  size. 

b)  Approximate  the  Intensity  Surface  by  a  Weigbfed  Planar  Fit. 
A  plane  is  fit  to  the  intensity  surface  of  the  set  of  pixels  in  each 
support  region.  The  fit  is  weighted  by  the  gradient  magni¬ 
tude  associated  with  the  pixels  so  that  the  important  intensity 
changes  will  dominate. 


c)  Extract  Attributes  bom  the  Edge-Support  Context.  A  straight 
line  and  a  set  of  features  is  then  extracted  from  the  support 
region  and  the  planar  At  of  its  associated  intensity  surface. 
The  attributes  extracted  include  length,  contrast,  sharpness 
(width),  location,  orientation,  and  straightness. 

d)  Filter  to  Extract  Straight  Lines  and  Texture.  Various  image 
events,  such  as  long  lines,  high  contrast  short  lines,  low  contrast 
short  lines,  and  lines  at  particular  orientations,  can  then  be 
extracted  by  Altering  on  the  desired  attributes. 

An  example  of  this  algorithm  applied  to  the  level  8  image  (256  x 
256)  with  all  lines  of  length  greater  than  2  pixels  (and  no  Altering  for 
contrast)  is  shown  in  Agure  2.9. 

IIL6.  Combining  the  representations 

The  region  and  edge  segmentations  complement  each  other  in  several 
ways.  The  region  segmentation  algorithm  segments  an  image  on  the 
basis  of  homogeneity  m  intensity,  color,  or  texture  features,  la  this 
paper  we  are  only  using  intensity  for  the  region  segmentation,  thereby 
producing  what  will  be  called  intensity  regions.  The  edge  algorithm 
deAnes  regions,  usually  several  pixels  wide,  which  support  edges.  The 
most  important  point,  however,  is  that  each  of  the  two  segmentation 
processes  results  in  a  pixel-based  representation.  The  natural  duality 
between  regions  and  their  boundary  lines  can  be  exploited  in  a  straight 
forward  manner,  since  the  edge-support  region  associated  with  a  linear 
feature  should  overlap  the  regions  on  either  side  of  the  region  bound¬ 
ary.  This  means  that  a  set  of  edge  support  regions  can  be  expected  to 
be  found  superimposed  on  the  boundary  of  an  intensity  region.  Con¬ 
versely.  a  set  of  adjacent  intensity  regions  can  be  associated  with  an 
edge  support  region  of  a  linear  feature.  This  grouping  of  regions  in  one 
segmentation  based  on  information  contained  in  the  other  is  performed 
by  tracing  the  boundary  of  a  selected  region  in  the  Arst  while  collect¬ 
ing  the  regions  encountered  in  the  second.  In  a  parallel  architecture 
such  as  the  processing  cone  the  set  of  related  regions  and  lines  can  be 
collected  in  parallel. 

There  are  several  other  advantages  to  using  the  combined  line  and 
region  information.  The  lines  produced  by  edge  support  regions  tend  to 
be  much  more  accurate  than  the  boundaries  of  the  region  segmentation 
due  to  the  global  nature  of  determining  the  line  placement;  eg.  this 
overcomes  local  problems  of  mixed  pixels.  Thus  the  limits  of  regions 
can  be  determined  more  accurately.  The  overlap  of  short  high-contrast 
lines  with  regions  also  provides  a  texture  attribute  for  regions.  Finally, 
a  set  of  edges  can  be  associated  with  a  region  and  descriptors  pertinent 
to  shape  can  be  extracted  from  these  lines.  These  obervatious  led  to 
improved  algorithms  for  the  detection  of  buildings. 

Let  us  consider  one  algorithm  for  verifying  buildings  based  upon 
the  line  and  region  information.  One  such  algorithm  traces  bound¬ 
aries  of  selected  intensity  regions  to  gain  from  the  corresponding  edge 
segmentation  the  collection  of  edges  surrounding  each  region.  Lines 
are  selected  which  lie  on  the  boundaries  of  regions  hypothesised  to  be 
buildings  based  upon  size  and  intensity  information.  Results  of  the  lin¬ 
ear  feature  algorithm  on  a  subimage  for  which  the  building  hypothesis 
is  to  be  veriAed  is  shown  in  Agure  2.10. 

Rectangularly  is  a  property  which  can  be  expected  of  many  build¬ 
ings  and  several  measures  can  be  applied  to  test  the  set  of  lines  selected 
for  this  and  a  variety  of  other  geometric  properties.  For,  example  the 
orientation  of  the  lines  can  be  bistogrammed  as  a  weighted  contribu¬ 
tion  of  the  length  of  the  lines.  One  simple  measure  is  the  degree  to 
which  the  lines  group  into  two  colinear  sets  at  each  of  two  orientations 
that  are  90  degrees  apart.  Figure  2.11  highlights  regions  whose  average 
orientations  between  the  two  clusters  is  within  .15  radians  of  a  right 
angle.  The  labelling  of  some  buildings  provides  a  context  for  Auding 
others.  It  also  reduces  the  search  space  for  the  processing  of  algorithms 
which  do  require  the  Anest  level  of  resolution  such  as  measuring  height 
to  the  best  possible  accuracy  in  stereo  pairs  and  determining  height 


from  shadow  length. 

III.6.  Finding  Runway* 

Runways  are  an  example  of  a  large,  clear,  linear  structure  whose  loca¬ 
tion  may  be  hypothesized  at  a  very  coarse  resolution.  The  identifying 
feature  is  a  set  of  long  lines  appearing  close  together  with  similar  slope. 
The  location  of  such  a  group  provides  the  system  with  a  starting  point, 
but  does  not  locate  the  full  extents  of  the  runway  since  fragmented  sec¬ 
tions  of  lines  at  the  ends  of  the  runway  may  tall  below  length  thresholds. 
If,  on  the  other  hand,  intensity  regions  could  be  identiAed  the  bomoge- 
niety  of  brightness  would  more  reliably  give  us  the  full  extent  of  the 
runway.  The  idea  of  the  method  which  follows  is  to  trace  the  bound¬ 
aries  of  the  selected  edges  to  identify  the  pertinent  intensity  regions. 
These  regions  can  be  used  to  build  a  mask  to  direct  further  processing. 

Figure  2.12  shows  the  image  reduced  to  level  7  using  2x2  averaging 
up  the  cone.  The  line  algorithm  is  applied  and  the  result  of  Altering 
the  edges  on  contrast  and  length  are  shown  in  Agure  2.12  and  2.14. 

The  edge  edge  representation  includes  the  position  p  and  orien¬ 
tation  0  of  each  line.  This  facilitates  the  grouping  of  lines  via  clusters 
in  Hough  space.  The  lines  remaining  after  the  Altering  shown  in  Agure 
2.14.  were  mapped  to  Hough  space.  The  largest  peak  in  this  histogram 
was  selected  and  the  edge  support  regions  of  that  cluster  were  identi¬ 
Aed.  These  selected  edge  support  regions,  projected  from  level  7  to  level 
8,  are  highlighted  in  Figure  2.15.  The  adjacent  regions  obtained  from 
the  region  segmentation  applied  to  the  intensity  image  are  highlighted 
in  Figure  2.16.  We  now  apply  the  focus  of  attention  mechanism  to  the 
regions,  selecting  the  sectors  are  shown  in  Agure  2.17.  These  sectors 
are  projected  to  level  9  and  the  region  algorithm  is  applied  within  these 
sectors.  The  result  is  shown  in  Agure  2.18. 

IV.  Conclusions  and  Futurw  Directions 

We  have  proposed  an  architecture  for  interpreting  large  images 
based  on  hierarchical  segmentation  and  interpretation  processes  func¬ 
tioning  under  focus  of  attention  mechanisms.  The  experiments  dis¬ 
cussed  is  Section  HI  demonstrate  the  potential  effectiveness  of  inte¬ 
grating  two  particular  complementary  segmentation  algorithms  into  a 
multi-resolution  structure.  They  have  shown  how  a  range  of  object  hy¬ 
pothesis  rules  can  be  used  to  guide  and  constrain  the  process  of  locating 
instances  of  the  objects.  The  rules  themselves  are  hierarchical  and  the 
strategies  for  applying  them  are  a  function  of  the  features  which  can 
be  reliably  extracted  at  the  various  image  resolutions. 

The  selection  of  candidate  regions  for  examination  at  a  higher  res¬ 
olution  was  accomplished  by  choosing  all  regions  which  satisAed  a  set 
of  object  dependent  constraints  on  region  and  line  attributes.  In  gen¬ 
eral,  the  results  from  such  a  simple  rule  will  be  unreliable  and  prone 
to  error.  Image  interpretation  is  implicitly  involved  with  the  problems 
associated  with  combining  information  from  multiple  sources  of  knowl¬ 
edge.  Any  perceptual  system  which  utilizes  processed  sensory  data 
must  recognize  the  fact  that  to  varying  degrees  the  information  will 
be  imperfect  and  prone  to  error.  With  this  in  mind  we  are  developing 
mechanisms  for  evidential  reasoning  and  inferencing  under  uncertainty 
(LOW82,  WES82j  in  order  to  construct  more  reliable  focussing  sets. 

Some  of  the  limitations  of  inferencing  using  Bayesian  probability 
models  are  overcome  using  the  Dempster-Shafer  formalism  for  eviden¬ 
tial  reasoning,  is  which  an  explicit  representation  of  partial  ignorance 
is  provided  (SHA76).  The  inferrnring  model  allows  “belief"  or  “conA- 
dence"  in  a  proposition  to  be  represented  as  a  range  within  the  |0,I| 
interval.  The  lower  and  upper  bounds  represent  support  and  plausibil¬ 
ity,  respectively,  of  a  proposition,  while  the  width  of  the  interval  ran  be 
interpreted  as  ignorance.  Wesley  |  WES82)  is  extending  this  approach  to 
the  problem  of  distributed  control  of  a  set  of  knowledge  sources  which 
can  be  applied  to  examine  particular  concepts  in  long-term  memory. 


Tkf  object  hypothesis  rules  described  is  the  Ust  section  cut  be 
applied  to  both  region  and  line  attributes  provided  by  our  intermedi¬ 
ate  level  of  representation.  A  set  of  KSs  associated  with  objects  can 
be  constructed  so  that  when  executed  they  will  input  evidence  to  the 
knowledge  network  as  support  or  refutation  of  propositions.  The  infer¬ 
ence  engine  will  then  be  turned  on  to  indirectly  update  belief  in  other 
propositions  based  upon  the  implications  of  the  direct  evidence. 

Once  the  inference  network  is  fully  integrated,  we  expect  the  hier¬ 
archical  segmentation  and  interpretation  process  to  operate  as  follows. 
First  the  local  histogram-based  region  segmentation  and  the  linear  fea¬ 
ture  extraction  algorithm  are  applied  at  a  coarse  level  of  resolution. 
Knowledge  sources  in  the  form  of  object  hypothesis  rules  are  then  ap¬ 
plied  to  region  and  line  attributes  at  that  level.  The  output  of  these 
rules  is  converted  to  a  form  appropriate  for  input  into  the  inference 
network  of  long-term  memory.  The  infereneing  process  is  then  invoked 
and  each  region  yields  a  support  and  plausibility  (i.e.,  a  range  of  belief) 
that  it  is  a  candidate  region  for  one  of  the  goal  objects.  The  region  and 

line  segmentation  algorithms  are  then  applied  at  a  liner  level  of  reso¬ 
lution.  but  ooly  oo  the  candidate  regions  which  have  high  support.  At 
this  finer  level  of  resolution  the  representation  of  the  object  b  of  a  dif¬ 
ferent  form  and  may  involve  more  expensive  object  rule  combinations 
of  the  region  and  line  attributes,  but  applied  only  to  a  small  subset  of 
the  image.  The  process  b  recursively  applied  to  finer  kveb  of  resolu¬ 
tion.  Other  ongoing  research  uses  the  inference  net  to  focus  the  system 
within  the  various  feature  spaces  that  are  active  and  that  work  will  also 
be  reported  in  the  future. 
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Figure  1.1. 

4096  x  4096  image  reduced  to  level  9  (512  x  512). 
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Figure  3.1. 


Segmentation  of  each  sector  using  the  expanded  set  of  labels. 


Segmentation  after  removing  sector  boundaries  and  elimination  of  small 
regions. 
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Figure  3.3. 

Level  8  segmentation  with  small  regions  filled  with  \\\  ,  bright  regions 
filled  with  ///.  Thus  bright  small  regions  are  displayed  as  cross-hatched 
areas. 


Figure  3.4. 

The  darkend  area  shows  those  sectors  which  intersect  small,  bright 
regions. 


Figure  3.5. 

Segmentation  at  level  9  under  the  mask  created  at  level  8. 


Figure  3.6. 

One  quadrant  of  the  segmentation  at  level  9,  under  the  mask  generated 
at  level  8.  Regions  which  satisfy  a  sire  and  intensity  constraint  are 
shown  as  cross-hatched  areas. 


Figure  3.9. 

Edges  obtained  at  level  8  from  the  application  of  the  Burns  algorithm. 
All  lines  of  length  greater  than  2  pixels  are  shown. 


Figure  3.10. 

Edges  at  level  10  of  length  greater  than 
than  12. 
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Figure  3.11. 

Figure  S.12. 

A  region  is  highlighted  if  the  difference  between  the  orientation  of  the 
two  largest  adjacent  edge  groups  is  within  .15  radian  of  a  right  angle. 

The  level  7  image. 

Figure  3.13. 

Edges  of  the  level  7  image  whose  length  is  greater  than  1  pixel. 


Figure  3.14. 

Edges  of  the  level  7  image  of  length  greater  than  9  pixels  and  contrast 


greater  than  12. 
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ABSTRACT 

Hernmiulalion  is  a  powerful  problem  solving  technique 
that  lias  received  little  attention  to  date.  In  contrast  to 
previous  problem  solving  techniques,  reformulation  requires 
reasoning  about  a  problem  formulation  rather  than  within  a 
problem  formulation.  The  research  reported  in  this  paper 
focuses  upon  two  reformulation  methods,  schema  driven 
reformulation  and  constraint  based  reformulation.  In  schema 
driven  reformulation  the  original  problem  formulation  is 
transformed  into  the  terms  and  vocabulary  of  a  general  prob¬ 
lem  solving  method,  such  as  divide  and  conquer.  In  con¬ 
trast,  constraint  based  reformulation  focuses  upon  discover¬ 
ing  I  hose  properties  or  a  problem  that  ran  be  exploited  by 
problem  solving  methods.  Schema  driven  and  constraint 
based  reformulation  are  complementar  methods  that  respec¬ 
tively  use  the  structure  and  juslifical  >ns  of  problem  solving 
methods. 

I.  INTRODUCTION 

This  research  evolved  from  a  project  to  automatically 
choose  good  representations  for  prob'  ms  in  vision  and 
robotics.  The  key  insight  is  that  good  representations  make 
explicit  the  problem  properties  that  can  be  exploited  by  prob¬ 
lem  solving  methods.  I'or  example,  choosing  a  joint  centered 
co-ordinate  system  as  a  representation  Tor  some  robotic  prob¬ 
lems  separates  out  invariants  anil  simplifies  the  transforma¬ 
tions  involved  in  joint  motions,  (iood  vision  research  is  also 
characterised  by  finding  properties  or  the  image  formation 
process  and  devising  algorithms  that  exploit  these  properties. 
I'or  example,  in  the  blocks  world  only  a  small  number  of  the 
coinbinaforially  possible  junction  types  are  physically  realis¬ 
able.  This  properly  is  exploited  in  the  Walls  algorithm. 

Current  standard  Al  problem  solving  Icelmiques  arc  not 
capable  of  solving  problems  through  reformulation.  I’roblem 
reformulation  requires  that  the  initial  description  be  trans¬ 
formed  into  the  terms  and  vocabulary  of  an  applicable  prob¬ 
lem  solving  method.  Weak  methods,  such  as  generate  and 
lest,  can  often  be  applied  to  the  problem  ;is  stated,  whereas 
strong  methods  such  ns  dynamic  programming  can  require 
substantial  reformulation.  The  efficiency  of  strong  methods 
is  derived  from  their  exploitation  of  problem  p  ro  per  Li  es  which 
are  only  implicit  in  the  initial  problem  description.  The  goal 
of  problem  reformulation  is  to  discover  these  implicit  prob¬ 
lem  properties  and  transform  the  problem  description  into 
terms  that  make  them  explicit  and  operational  within  the 
context  of  a  problem  solving  method. 


Saul  Amarcl  has  written  several  papers  on  problem 
reformulation  including  detailed  accounts  of  the  steps  needed 
in  reformulating  the  missionaries  and  cannibals  puzzle 
[Amarcl  68|  and  the  tower  of  hanoi[Amarcl  82].  lie  em¬ 
phasized  the  need  for  discovery  of  useful  problem  properties. 
Ernst  and  Goldstein  have  written  a  special  purpose  pro¬ 
gram  to  transform  a  problem  statement  into  the  parameters 
for  a  Cl’S  algorithm  through  the  discovery  of  operator 
invariants(IOrnsl  82].  Similnrily,  Douglas  Smith  has  worked 
on  the  mechanical  transformation  of  problems  into  Divide 
and  Conquer  algorillmis[Smilh  82).  Korf  wrote  an  ar¬ 
ticle  on  a  general  framework  Tor  reformulation  focusing 
upon  transformation  rules  that  implement  homomorphisms 
and  isomorphisins(Korf  80],  lie  also  wrote  a  special  pur¬ 
pose  system  that  transformed  a  problem  statement  into 
the  parameters  of  a  specialized  extension  of  GPS  called 
macro  problem  solving.  This  requires  the  discovery  of 
operator  dccomposabilily,  whirh  is  a  problem  properly  re¬ 
lated  lo  the  invariants  used  in  GPS  nlgorilhms|Korf  83]. 
Jack  Moslow  and  Richard  Keller  have  worked  on  a  special 
type  of  reformulation  called  Operationalization.  The  goal  of 
operationalization  is  to  transform  a  problem  statement  in  a 
language  that  cannot  be  directly  executed  into  the  ter  of 
a  language  that  can  be  directly  executed.  In  their  sysl 
operationalization  is  achieved  through  a  sequence  of  rewrite 
rules.  Moslow  has  implemented  a  partially  automatic  sys¬ 
tem  and  demonstrated  it  on  operationalizing  advice  for  the 
card  game  hearts  [Moslow  83].  Richard  Keller  presi  riled  a 
paper  on  finding  heuristics  Tor  symbolic  integration  through 
operalionalizalion(Kcller  83|. 

The  focus  or  the  research  reported  in  this  fiaper 
is  reformulation  through  discovery  of  exploitable  problem 
constraints  and  instantiation  or  general  problem  solving 
methods.  The  first  I  denote  by  constraint  driven  reformula¬ 
tion,  the  second  by  schema  driven  reformulation.  The  map¬ 
ping  Troni  the  original  problem  description  to  a  reformulated 
description  is  a  by-product  of  these  two  interacting  processes. 
The  criterion  Tor  evaluating  a  reformulation  is  the  computa¬ 
tional  complexity  or  the  resulting  algorithm.  In  contrast 
to  the  work  cited  above,  the  search  Tor  problem  properties 
is  not  restricted  to  one  particular  type  of  property  such  as 
operator  invariants,  or  one  particular  type  or  algorithm  such 
as  GPS.  In  addition,  the  criterion  for  evaluating  a  reformula¬ 
tion,  whirh  in  this  work  iscqmputalion.il  complexity,  is  used 
explicitly  to  constrain  the  search  for  useful  properties. 

The  balance  of  this  section  gives  an  overview  of  schema 
driven  reformulation  and  constraint  based  reformulation. 
Section  II  describes  schema  driven  reformulation  through 
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an  example  from  computational  geometry,  and  Section  III 
describes  constraint  driven  reformulation.  Section  IV  gives 
the  algorithms  for  these  two  reformulation  methods. 

\  schema  is  an  abstract  plan  for  a  class  of  algorithms, 
examples  arc  : 

1.  Divide  and  Conquer 

2.  Dynamic  Programming 

3.  Iterative  Relaxation 

A  schema  specifies  the  logical  and  computational 

relationship  betw  n  its  subparts.  The  logical  relationships 

between  subparts  forms  the  justification  of  the  schema  i.e. 
why  it  works.  Wl.cn  instantiated  to  a  particular  algorithm 
the  justification  is  a  proof  of  program  correctness.  The  com¬ 
putational  relationships  of  control  llow  and  data  ll«»w  between 
subparts  will  be  refered  to  as  the  structure  of  a  schema. 

Schema  driven  reformulation  proceeds  by  choosing  a 
schema  lor  instantiation  and  then  refining  its  subparts  in 
order  to  solve  a  given  problem.  For  example,  after  choosing 
divide  ami  conquer  the  split  step  ami  the  merge  step  would  be 
relined.  A  well  chosen  schema  can  break  a  dillicult  problem 
into  more  manageable  subproblems,  i.e.  the  refinement  of 
the  subparts  of  the  schema. 

The  second  general  method  is  to  discover  the  properties 
of  the  problem  first,  and  then  choose  and  instantiate  a 
schema  based  upon  this  information.  The  dilliculty  is  in  con¬ 
straining  the  search  space  of  useful  properties,  'flic  search 
space  of  properties  is  defined  by  two  lattices  in  a  space 
of  semantic  prototypes.  A  semantic  prototype  is  a  set  of 
abstract  objects  ami  a  set  or  constraints  that  hold  between 
the  abstract  objects.  A  semi-group  is  a  semantic  prototype, 
in  Tact  most  oT  the  semantic  prototypes  discussed  in  this 
paper  are  algebraic  structures.  Semantic  prototypes  have 
also  been  refered  to  ;is  abstractions^ lenesereth  NO),  semantic 
cliches  (Chapman  83],  and  the  generic  term  conceptsjLcnat 
83|. 

The  firsl  lattice  in  the  space  <>r  semantic  prototypes  is 
formed  by  the  relation  specialization.  A  specialization  im¬ 
poses  additional  constraints  on  a  semantic  prototype.  Koi 
example  an  equivalence  relation  is  a  specialization  of  a  bi¬ 
nary  relation.  The  additional  constraint  is  that  the  relation 
be  rcllcxivc,  symmetric,  and  transitive.  The  second  lattice  is 
formed  by  the  relation  extension.  A  semantic  prototype  is 
an  extension  or  another  if  it  includes  additional  abstract  ob¬ 
jects.  Kor  example  a  monoid  is  an  extension  of  a  semi-group, 
because  it  includes  an  additional  element  -  the  identity  ele¬ 
ment.  Semantic  prototypes  can  also  be  linked  by  mappings, 
which  specify  how  to  transform  one  semantic  prototype  into 
another.  Kor  example  a  mapping  specifies  how  to  transform 
a  total  order  into  a  sequence. 

The  search  Tor  useful  properties  is  guided  by  this  lat¬ 
tice  of  semantic  prototypes  and  the  criterion  of  computa¬ 
tional  complexity.  Briefly,  the  first  step  is  to  characterize 
the  objects  specified  in  the  problem  statement  in  terms  or  the 
semantic  prototypes.  Kor  example,  if  a  binary  relation  is  part 
of  the  problem  specification,  it  is  characterized  in  terms  of 
semantic  prototypes  such  as  partial  orders,  equivalence  rela¬ 
tions,  and  total  orders.  Then  algorithms  associated  with  the 
semantic  prototypes  are  examined  to  determine  whether  they 
meet  the  criteria  specified  for  computational  complexity.  If 
not,  then  the  search  for  problem  properties  within  the  lattice 
continues  until  algorithms  that  meet  the  specified  criteria  or 
computational  complexify  arc  found. 


Candidate  properties  arc  firsl  examined  empirically  on 
several  examples  to  determine  if  they  are  plausible.  Kor  ex¬ 
ample,  the  first  slep  in  determining  whether  a  binary  relation 
is  symmetric  is  to  generate  some  examples  of  the  relation  and 
then  determine  whether  the  examples  are  symmetric.  I  his  is 
similar  to  Crlenler’s  use  of  models  in  his  geometry  theorem 
prover.  IT  a  property  is  empirically  plausibl  ,  then  an  attempt 
is  made  to  prove  that  it  holds.  General  purpose  theorem 
provers  are  loo  inefficient  (e.g.  causing  a  Symlxilics  MOO  lisp 
machine  to  run  out  or  memory).  Special  purpose  theorem 
provers  will  be  used  instead.  The  details  have  not  been  been 
fully  worked  out,  but  promising  approaches  include: 

1.  Theorem  proving  schemas  such  as  induction,  which 
are  instantiated  much  like  algorithm  schemas. 

2.  Specialized  methods  Tor  subclasses  or  theorems.  Kor 
example,  algebraic  manipulation  is  a  specialized 
method  for  proving  algebraic  equalities. 

II.  SCHEMA  DRIVEN  REFORMULATION 

The  strategy  of  schema  driven  reformulation  is  to  pick  a 
problem  solving  schema  and  perform  a  partial  mapping  from 
the  problem  to  the  schema.  The  partial  mapping  instantiates 
some  or  the  operators  and  some  of  the  justifications  for  steps 
in  the  schema.  The  partially  instantiated  operators  and 
partially  instantiated  justifications  are  then  refined  through 
deduction  and  analysis  or  examples  until  a  Hilly  instantiated 
algorithm  is  derived  from  the  schema. 

This  strategy  will  be  illustrated  with  an  example  from 
computational  geometry: 

Given  a  set  of  points  on  the  plane,  the  problem  is  to  find 
those  points  that  have  no  other  points  in  their  upper  right 
hand  quadrant.  (See  figure  I)  More  formally,  the  maximal 
point  problem  is  formulated  as  a  function  from  a  set  of  points 
S  on  a  plane  with  an  x-y  co-ordinate  system  to  a  subset  T  of 
these  points.  The  condition  is  that  for  each  member  t  of  the 
subset  T,  no  other  member  of  S  is  in  the  upper  right  hand 
quadrant  of  t.  The  objective  is  to  find  an  N  log  N  algorithm, 
where  N  is  the  size  of  the  set  S. 

{t  €  S\  (V.es|*( 0  >  i(»)  V  (y(t)  >  y(»)  V  t  -  »)} 

The  initial  mapping  simply  specifics  that  a  set  labeled 
S  is  the  input,  that  a  subset  T  is  the  output,  and  that  the 
input/oulput  assertion  specified  above  holds  between  S  and 
T.  Since  the  recursive  calls  are  instances  of  the  divide  and 
conquer  schema,  similar  labels  and  assertions  arc  attached 
to  the  input/outputs  of  the  recursive  calls.  Thus  SI  is  the 
input  of  the  first  recursive  call  and  T1  is  the  output  of  the 
first  recursive  call.  Similarily  S2  is  the  input  or  the  second 
recursive  call  and  T2  is  the  output  or  the  second  recursive 
call.  The  partially  instantiated  divide  and  conquer  schema 
is  shown  in  figure  2,  the  base  case  and  issues  or  termination 
will  not  be  considered  in  order  to  simplify  the  example. 

Below  is  a  synopsis  of  the  reformulation  of  this  problem 
into  the  terms  of  divide  and  conquer.  Control  of  the  search 
and  branches  that  ultimately  fail  will  not  be  considered, 
though  the  generation  of  each  move  in  the  reformulation  will 
be  elaborated. 
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The  specification  of  «an  N  log  N  algorithm  conaUains  the 
split  ami  merge  steps  to  be  linear  time,  and  the  outputs  of 
the  split  step  to  be  balanced  in  size.  This  is  an  example  of 
how  the  criteria  for  judging  the  merits  oT  a  reformulation 
can  constrain  the  search.  The  logical  constraint  on  the  split 
step  is  that  it  be  a  partitioning  function  for  the  input  type, 
in  this  case  a  set.  Partitioning  functions  for  a  binary  split 
can  be  i  m  piemen  led  using  a  characteristic  function  on  a  sot 
(mapping  each  element  to  I  or  0).  A  heuristic  is  the  incor¬ 
poration  of  a  component  of  the  problem  specification  into 
the  split  step;  possibly  this  will  appropriately  constrain  the 
remaining  portions  of  the  schema.  One  possibility  is  to  spe¬ 
cialize  the  binary  relation  x(t)  >  x(s)  to  a  unary  function  by 
lixing  s.  The  constraint  of  balanced  outputs  requires  that  s 
be  fixed  to  the  point  with  median  x  co-ordinate.  'Phis  requires 
a  preprocessing  slop  of  sorting  the  points  by  x-co-ordinate, 
which  lil.s  within  the  time  hounds  of  N  log  N. 

The  next  step  is  to  propagate  the  assertions  from  this 
heurjstically  chosen  split  step  through  the  schema,  in  order 
to  constrain  subsequent  steps,  liriolly,  the  highlights  of  this 
propagation  arc: 

1.  Disjointness  of  subsets  output  from  the  partitioning 

V,i6si,.aes!i[*(»l)  <  I(s2)l 

2.  T2  is  a  subset,  of  T,  by  definition  with  roaprcl  to 
the  partition  S2  ami  because  «r  being  greater  in  x  co¬ 
ordinate  for  the  partition  SI 

VtKT2,.KsM»'*)  <  V  y(a 2)  <  y(<2)  V  <2  =  s2| 

V,2€T2,.l6Slh(»l)  <  *(I2)| 

3.  Part  of  Tl  is  a  subset  of  T,  namely  those  points  with 
y  co-ordinate  greater  than  y  co-ordinate  of  all  points 
in  SI 

V.ie7M,.ies.[*(»l)  <  xUOvwM)  <  !/(fl)vn  =  »l] 
VtieTi,.2es"ilv(*2)  vU0—  >  x(s2)  <  i(fl)Vy(»2)  < 
y[t  I )  V  f  1  =  «2| 

These  assertions  constrain  the  merge  step  to  be  the 
union  of  T2  and  those  elements  of  Tl  which  satisfy  the  third 
assertion.  Unfortunately,  the  running  lime  for  the  charac¬ 
teristic  function  denoted  by  this  assertion  is  n2,  which  ex¬ 
ceeds  the  linear  lime  bound  constraint.  Constraint  based 
reformulation,  discussed  more  fully  in  the  next  section,  deter¬ 
mines  that  assertion  3  can  be  reformulated  in  terms  or  the 
maximum  <>r  S2,  with  respect  to  the  total  order  induced  by 
y  co-ordinate.  The  cost  is  now  linear  for  finding  the  subset 
of  Tl  which  satisfies  assertion  3.  The  divide  and  conquer 
schema  is  now  hilly  instantiated  as  a  N  log  N  algorithm  to 
compute  the  maximal  points. 


•  Denotes  maximal  point 


♦  T 

Figure  2.  Partially  Instantiated  Divide  and  Conquer  Schema 


m.  CONSTRAINT  BASED  REFORMULATION 

Through  discovery  of  implicit  constraints  :uul  properties 
inherent  in  a  problem  it  is  often  possible  to  derive  repre¬ 
sentations  where  the  syntax  matches  the  semantics  of  the 
problem  and  the  control  structure  exploits  the  constraints. 
Mathematics,  particularily  abstract  algebra,  provides  a  rich 
vocabulary  for  describing  Lhe  structure  of  many  prob¬ 
lem  domains,  in  particular  the  domain  of  computational 
geometry. 

The  first  step  in  constraint  based  reformulation  is  to 
characterize  the  problem,  as  given,  in  terms  of  basic  seman¬ 
tic  prototypes.  The  prototypes  often  have  strong  problem 
solving  methods  associated  with  them,  reducing  the  problem 
or  reformulation  to  one  of  recognition.  More  importantly, 
the  recognition  of  semantic  prototypes  nerves  to  guide  the 
search  for  properties  that  can  lie  exploited  by  problem  solv¬ 
ing  methods. 

Kecognition  of  so  man  tie  prototypes  proceeds  both 
analytically  and  empirically  for  all  components  of  the  ini¬ 
tial  problem  description.  For  the  maximal  point  problem, 
the  generation  and  examination  or  several  input/output  pairs 
determines  that  the  output  is  not  the  mill  set,  and  usually 
.a>t  a  singleton  set.  If  the  output  had  always  been  a  singleton 
set,  then  the  problem  description  could  be  reformulated  as  a 
function  from  a  set  of  points  to  a  point. 

The  characteristic  function  for  the  maximal  points  is 
analysed  in  terms  of  its  components,  which  arc  a  binary  rela¬ 
tion  and  a  quantified  variable.  The  binary  relation  defined  by 
the  disjunction  within  the  characteristic  function  is  analyzed 
in  terms  of  the  basic  algebraic  properties  of  binary  relations, 
such  as  symmetry,  transitivity,  and  functional  dependence  of 
arguments.  It  is  found  that  the  relation  is  neither  symmetric 
rior  anti- symmetric,  and  it  is  not  transitive.  Since  a  relation 
can  be  defined  in  terms  of  its  complement,  the  complement  is 
also  characterized.  The  complement  is  round  to  be  antisym¬ 
metric  and  transitive,  but  not  total,  and  therefore  defines 
a  partial  order.  The  problem  can  then  be  reformulated  as 
finding  the  minima  of  this  partial  order,  or  the  maxima  of 
its  dual. 

Definition  or  partial  order  and  the  reformulated  problem 
desc  ription: 

r<irliaH«t)  J=»  (i(i)  <  *(„)  a  v(i)  <  y(s)  A  t  ^  aj 

{t  €  S|V,es-i/’artio/(.ii)} 


Points  lass  in  x  co  ordinate  can  be  ignored. 

. 

o  • 


Algorithms  associated  with  finding  lire  maxima  of  a 
partial  order  are  n*,  80  in  Ltiis  case  reformulation  through 
recognition  does  not  improve  the  computational  complexity. 
However,  the  recognition  of  the  partial  order  serves  as  a 
guide  for  exploring  possibilities  related  to  partial  orders. 
In  particular,  total  orders  are  specializations  of  partial  or¬ 
ders  whose  associated  algorithms  have  very  good  lime  com¬ 
plexities.  The  goal  becomes  to  reformulate  the  problem 
description  in  terms  of  a  total  order.  In  order  to  be  relevant, 
the  total  order  should  be  an  extension  or  the  partial  order, 
so  that  if  two  points  are  related  in  the  partial  order  they  arc 
necessarily  related  by  the  total  order: 

i‘arlial(s  t)  ->  Total(s  l) 

Two  total  orders  are  part  of  Hie  conjunction  defining  the 
partial  order,  namely  ordering  by  x  co-ordinate  and  ordering 
by  y  co-ordinate.  Since  a  conjunction  always  implies  each  of 
its  conjunct.*!,  cither  of  these  total  orders  are  good  candidates. 
Ordering  by  x  co-ordinate  is  chosen  arbitrarily. 

The  goal  is  to  recover  the  maxima  or  the  partial  order 
using  th(  total  ordering  by  x  co-ordinate,  under  the  con¬ 
straint  of  N  big  N  time.  This  goal  is  split  into  the  subgoal 
ol  finding  necessary  conditions  on  Hie  maxima  given  the  to¬ 
tal  order,  and  the  subgoal  of  finding  sullieient  conditions  on 
the  maxima  given  the  total  order.  The  two  subgoals  arc  ex¬ 
plored  in  parallel  until  one  succeeds  or  both  fail.  If  necessary 
conditions  are  found,  then  the  subgoal  is  set  up  to  find  com¬ 
plementary  sullieient  conditions  in  order  to  find  an  equivalent 
condition.  Similarity,  iT  the  subgoal  for  sufficient  conditions 
succeeds,  then  the  su bgn.nl  is  set  up  to  find  complementary 
necessary  conditions  in  order  to  find  an  equivalent  condition. 

To  find  necessary  conditions  empirically,  examples  of 
maxima  are  generated  and  relations  related  to  the  total  order 
are  determined.  The  empirical  approach  briefly  described 
below  resembles  that  of  Kmdc  [Etude  83].  Random  examples 
of  maximal  points  are  generated,  and  semantic  prototypes 
linked  by  extension  to  total  orders  in  the  lattice  are  retrieved, 
such  as  minimum,  maximum,  successor,  and  monolonieity; 
which  arc  then  used  to  examine  the  examples  with  respect 
to  the  ordering  by  x  co-ordinate. 


I  For  points  greater  in  x  co-ordinata, 
i  monotonicity  condition  must  hold. 


Figure  4.  Equivalent  condition  for  a  point  T 
to  be  a  maximal  point. 
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The  necessary  condition  Found  is  that  the  maxima  are 
monotonically  increasing  in  y  and  dec  reasing  in  x.  The  next 
step  is  to  find  a  complementary  sufficient  condition,  in  order 
to  develop  a  reformulated  problem  description  anchored  in 
the  ordering  by  x  co-ordinate.  This  is  done  by  considering 
sufficient  conditions  For  a  point  l  to  be  a  maxima  of  the  par¬ 
tial  order.  The  ordering  by  x  co-ordinate  partitions  the  rest 
of  the  points  into  those*  greater  than  I.  and  those  less  than  t. 

I'or  those  less  than  t,  f  lic  first  clause  in  the  disjunction  (x(s)  < 
x(ljj  satisfies  the  iiiaxiinalily  condition  for  t.  I'or  those  points 
greater  than  t,  the  moiiolonicity  condition  is  a  sufficient  and 
necessary  condition  -  that  is,  if  t  is  greater  in  its  y  co-ordinate 
than  Ihe  next  maxima  in  x  co-ordinate,  then  t  is  also  a  max¬ 
ima.  These  two  conditions  provide  an  equivalent  condition 
to  the  original  problem  formulation,  all  that  remains  is  to 
instantiate  a  simple  enimicrate-tcsl-accumulate  schema  to 
derive  an  N  log  N  algorithm.  An  N  log  N  preprocessing 
step  is  required  to  order  the  points  |>y  x  co-ordinate.  The 
enumeration  is  by  descending  order  oT  x  co-ordinate,  which 
implicitly  encodes  the*  test  for  all  points  less  in  x  co-ordinate. 

The  nmnofonicily  condition  instantiates  the  lest  operation  of 
the  schema.  It  is  constant  time*  because  only  the  previously 
round  maximum  needs  to  lx*  checked.  The  accumulation  is 
a  simple  set  accumulation. 

IV.  DESCRIPTION  OF  REFORMULATION  STRATEGIES 

A  schema  serves  as  an  abstract  plan  for  ('Hiding  the 
search  for  an  algorithm.  The  schema  decomposes  the 
reformulation  into  mutually  constraining  subpnrls,  defining 
the  relationship  between  the  now  terms  or  the  problem 
description.  Schemas  arc  implemented  using  the  formalism 
of  the  programmer's  apprentice  (Rich  81).  Like  semantic 
prototypes,  schemas  are  organised  in  a  dual  hierarchy  with 
specialisation  and  extension  links.  As  an  example,  quick¬ 
sort  is  a  specialisation  of  Divide  and  Conquer.  An  ex  tension 
to  a  schema  is  an  addition  of  new  components,  such  as  a 
preprocessing  step  to  sort  the  points  by  x  co-ordinate.  An 
algorithm  is  a  schema  whose  operators  and  test  predicates 
have  been  fully  specialised  to  base  level  operations.  The 
justifiealion  for  a  schema  records  the  logical  constraints  be¬ 
tween  subparts  and  the  preconditions  lor  application  or  a 
schema. 

The  strategy  of  schema  driven  reformulation  is  as  fol¬ 
lows: 

1.  Choose  a  kernel  schema  based  upon  the  known 
properties  of  the  problem.  The  preconditions  of  the 
schema  must  not  contradict  the  known  properties  of 
the  problem.  For  example,  one  precondition  or  divide 
and  conquer  is  that  the  input  be  a  parlilionablc  struc¬ 
ture. 

2.  I’ropagalc  the  known  constraints  on  the  problem  and 
the  constraints  on  computational  complexity  through 
the  schema.  Farts  of  the  schema  will  probably  be 
uninslanliated,  unless  sufficient  constraints  on  the 
problem  have  already  been  discovered. 

3.  Choose  an  uninslanliated  component  of  the  schema 
and  instantiate  it.  Choosing  an  instantiation  for  part 
of  a  schema  is  a  recursive  call  on  choosing  a  schema 
for  solving  a  problem.  In  this  case  the  uninstanlialod 
component  is  the  new  problem.  In  the  process  of 
instantiation,  the  kernel  schema  might  lie  ex  tern  led. 


I’ropagalc  the  constraints  from  instantiating  a  com¬ 
ponent  through  Ihe  rest  of  the  schema. 

5.  R  the  schema  is  fully  iuslnnl.iul.ed  then  terminate, 
otherwise  loop  hack  to  step  3. 

Constraint  based  reformulation  is  essentially  goal 
directed  discovery.  The  objective  is  to  find  properties  and 
constraints  within  a  problem  that  match  the  preconditions 
of  efficient  problem  solving  methods.  It  complements  steps 
one  and  three  or  the  schema  driven  strategy.  In  the  example 
of  constraint  based  reformulation,  all  the  problem  properties 
were  luiiml  before  a  schema  w;is  chosen  for  instantiation,  but 
in  actual  operation  the  strategy  is  mixed  between  schema 
driven  reformulation  and  constraint  based  reformulation. 

Constraint  based  reformulation  is  based  upon  the  use 
of  semantic  protolypcs-abstract  structures  that  capture  ex¬ 
ploitable  regularities  in  a  domain.  Semantic  prototypes  arc 
also  organised  in  a  dual  hierarchy.  For  example  a  group  is 
an  extension  or  a  monoid,  while  a  total  order  is  a  specializa¬ 
tion  of  a  partial  order.  Often  the  constraints  that  define 
semantic  prototypes  can  be  found  independently.  For  ex¬ 
ample  a  binary  relation  is  a  partial  order  if  it  is  antisym¬ 
metric  and  transitive.  These  constraints  are  found  indepen¬ 
dently  and  cached,  for  use  in  possibly  instantiating  other 
semantic  prototypes.  This  substantially  reduces  the  search 
spare.  Semantic  prototypes  arc  determined  both  empirically 
and  analytically.  The  empirical  search  is  done  by  generating 
examples  and  verifying  that  the  examples  are  consistent  with 
the  constraints.  The  analytic  search  is  done  by  setting  up  the 
goal  of  proving  that  the  constraints  or  a  prototype  hold  for 
some  applicable  structure  such  as  a  binary  fetation.  Special 
purpose  theorem  proving  techniques  are  used,  since  general 
purpose  theorem  provers  arc  too  inefficient. 

The  strategy  of  constraint  based  reformulation  is  as  fol¬ 
lows: 

1.  Characterize  the  problem  description  as  staled 
in  terms  or  semantic  prototypes.  Although  the 
prototypes  will  vary  between  domains,  certain 
prototypes  arc  useful  in  almost  any  domain,  such  as 
symmetries  between  arguments  to  a  relation  anil  tran¬ 
sitivity.  Reformulate  the  description  in  terms  of  the 
applicable  semantic  prototypes. 

2.  I  he  semantic  prototypes  guide  the  search  for 
alternative  necessary  and  sullieient  conditions. 
Specializations  of  a  prototype  are  often  associated 
with  efficient  problem  solving  methods.  The  seman¬ 
tic  prototypes  are  also  associated,  through  extensions, 
to  other  potentially  useful  prototypes.  For  example 
moiiolonicity  is  an  extension  of  a  total  order. 

3.  When  a  useful  necessary  condition  is  found,  the  sub¬ 
goal  of  finding  a  complementary  sullieient  condition 
is  pursued.  Similarity,  when  a  useful  sullieient  condi¬ 
tion  is  found,  Ihe  suligoal  of  finding  a  complementary 
necessary  rendition  is  pursued.  This  usually  entails 
doing  a  case  analysis  on  the  problem,  where  the  par¬ 
titioning  of  the  problem  into  cases  is  based  upon  the 
prototypes  involved  in  the  necessary  or  sufficient  con¬ 
dition. 

i.  When  an  equivalent  condition  is  found  to  the  original 
problem  description,  all  that  remains  is  to  choose  and 

instantiate  a  problem  solving  schema. 


/•  . 
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V.  SUMMARY 

Problem  solving  ill  Uie  domains  of  vision  and  robotics 
typically  requires  tbc  discovery  and  exploitation  of  problem 
properties  only  implicit  in  the  initial  problem  specification. 
In  the  future,  flexible  automatic  systems  will  need  to  have 
some  ability  to  automatically  reformulate  problems.  The 
domain  of  computational  geometry  was  chosen  as  a  simplified 
domain  for  studying  methods  of  reformulation.  Two  general 
methods,  schema  driven  reformulation  and  constraint  based 
reformulation,  are  described  in  this  paper.  A  system  Lhat 
uses  these  two  reformulation  methods  is  currently  being 
implemented  Tor  the  domain  of  one  and  two  dimensional  com¬ 
pel  ational  geometry. 
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A  Fast  Surface  Interpolation  Technique. 
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Abstract 

A  method  for  interpolating  a  surface  through  3-D  data  is 
presented  The  method  is  computationally  efficient  and  general 
enough  to  allow  the  construction  of  surfaces  with  either  smooth 
or  rough  texture. 


1.  Introduction 

In  image  analysis  we  are  often  faced  with  the  fact  that  the 
measurements  we  make  in  an  image  only  constrain  properties  of 
the  3-D  world,  instead  of  specifying  them.  Analysis  techniques 
that  recover  3-D  shape  information  from  image  measurements 
incorporate  very  restrictive  assumptions  about  the  nature  of  the 
world.  In  our  attempts  to  avoid  the  need  for  these  restrictions, 
we  have  been  examining  hypothesis-and-test  methods.  If  we 
assume  that  we  are  able  to  obtain  some  shape  data,  from  which 
we  ran  hypothesize  an  approximate  shape  model  for  the  world, 
then  we  can  use  this  model  to  predict  image  features.  To  proceed 
from  shape  data  to  an  approximate  shape  model  we  need  to  “flesh 
out’  the  data.  In  this  paper  we  address  the  problem  of  fitting  a 
surface  to  a  set  of  points  whose  3-D  locations  are  known.  While 
our  interest  centers  on  fitting  a  surface  to  3-D  location  data  that 
have  been  acquired  by  processing  images  of  that  surface,  the 
technique  developed  has  application  to  a  broad  class  of  surface¬ 
fitting  tasks. 

To  select  a  surface-fitting  procedure,  it  is  insufficient  merely 
to  know  the  data  set  and  to  require  that  a  surface  be  fitted  to  the 
points  in  that  set.  We  also  need  to  know  the  desired  properties  of 
the  surface,  the  characteristics  of  the  data,  and  the  uses  to  which 
the  fitted  surface  will  be  put.  If  we  are  building  a  surface  to 
allow,  say,  water  runoff  estimates  to  be  made,  smoothness  may 
be  a  desired  property  for  that  surface.  For  realistic  rendering 
of  a  natural  surface  in  computer  graphics,  however,  a  fractal 
surface  may  be  preferable.  While  the  technique  we  develop 
ran  construct  either  smooth  or  rough  surfaces,  our  applications 
generally  require  the  former.  Our  examples,  Figures  3  and  4, 
show  both  types. 

Besides  the  desired  properties  of  the  fitted  surface,  the 
characteristics  of  the  data  limit  the  approach  we  must  adopt 
to  surface  construction.  In  fitting  a  surface  we  must  balance  the 

The  research  reported  herein  was  supported  by  the  Defense  Advanced 
Research  Projects  Agency  under  Contract  MDABO3-U-C-0O27  and  hy  the 
National  Aeronautics  and  Space  Administration  under  Contract  NASA 
0-10604.  These  contracts  are  monitored  by  the  U  S.  Army  Engineer 
Topographic  Laboratory  and  by  the  Texas  AtM  Research  Foundation  for 
the  Lyndon  B.  Johnson  Space  Center. 


influence  exerted  by  the  data  values  themselves,  against  that 
exerted  by  the  implicit  surface  model  embedded  in  any  fitting 
procedure.  If  our  data  values  are  inaccurate  and  we  know  the 
class  of  surfaces  that  should  fit  the  data,  we  can  usually  let  the 
surface  model  dominate  the  construction  process.  Least-square 
methods  are  typical  of  procedures  that  prefer  a  model  to  data.  In 
general,  techniques  whose  resultant  surfaces  do  not  conform  ex¬ 
actly  to  the  data  are  known  as  approximation  methods.  Methods 
that  produce  surfaces  conforming  exactly  to  the  data  are  called 
interpolation  methods. 

The  selection  of  an  approximation  or  interpolation  method 
is  influenced  by  properties  of  the  data  other  than  their  accuracy. 
Consider,  for  example,  the  terrain  data  collected  by  a  surveyor. 
In  selecting  the  places  at  which  to  make  measurements,  he  con¬ 
siders  the  breakpoints  of  the  surface  that  is,  those  places  on 
the  surface  where  the  gradient  is  discontinuous  -  and  his  data 
include  measurements  at  these  breakpoints.  Surface  reconstruc¬ 
tion  by  linear  interpolation  over  triangular  surface  patches  is 
possible  because  the  surveyor  has  furnished  not  only  the  3-D 
data,  but  also  an  implicit  statement  that  the  surface  between 
his  points  can  be  approximated  by  planar  patches.  In  match¬ 
ing  stereo  pairs  of  images,  an  edge-based  matcher  provides  more 
than  the  3-D  data  it  produces  Like  a  surveyor’s  data,  it  too 
makes  an  implicit  statement  about  the  continuity  of  the  imaged 
surfaces.  On  the  other  hand,  an  area-based  correlation  matcher 
says  less  about  surface  continuity,  but  has  the  desirable  behavior 
of  providing  regularly  spaced  data. 

Such  data  can  usually  be  processed  with  considerably  less 
computational  effort  than  data  that  are  irregularly  spaced.  The 
volume  of  data,  the  regularity  of  their  spacing,  the  implicit 
characteristics  of  their  collection  procedure,  and  their  accuracy 
are  all  essential  parameters  in  selecting  a  surface-fitting  tech¬ 
nique.  For  our  applications  we  choose  to  investigate  interpola¬ 
tion  methods.  We  want  methods  that  will  work  with  irregularly 
spaced  data,  but  still  achieve  substantial  computational  savings 
if  we  can  use  a  regular  grid  of  data  points.  We  need  to  be  able 
to  handle  thousands  of  such  points.  As  a  rule,  we  do  not  want  to 
use  implicit  properties  of  the  data  that  stem  from  their  collection 
procedure. 

The  uses  to  which  the  fitted  surface  will  be  put  further 
restricts  the  set  of  applicable  surface-fitting  procedures.  If  the 
task  at  hand  is  surface  area  estimation,  tbs  accuracy  of  the 
surface  gradients  is  not  important.  Conversely,  if  we  wish  to 
use  the  fitted  surface  to  generate  the  latter's  image  under  some 
known  lighting  conditions,  the  surface  gradient  information  then 
becomes  crucial.  We  can  classify  the  uses  of  fitted  surfaces  by 
the  surface  derivatives  that  are  needed.  An  application  that  does 
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not  require  surface  derivatives  to  be  calculated  can  usually  be 
satisified  by  a  surface  composed  of  local  patches.  That  is,  the 
surface  is  fitted  locally  patch  by  patch,  with  each  patch  deter¬ 
mined  by  a  small  number  of  local  data  points.  Such  methods 
have  strong  surface  models  and  few  data  are  needed  to  instan¬ 
tiate  them.  As  a  consequence,  however,  the  surface  derivatives 
are  more  a  function  of  the  surface  model  than  of  the  data.  The 
amount  of  data  used  to  determine  the  surface  patch  may  be 
barely  sufficient  to  calculate  an  average  value  for  the  surface 
derivatives  across  the  whole  patch;  besides,  any  variations  in 
derivatives  across  the  patch  are  caused  by  the  model,  not  the 
data.  The  more  data  employed,  the  less  is  the  influence  of  the 
surface  model  on  the  calculation  of  surface  derivatives.  In  the 
extreme  case,  all  the  data  may  be  used  to  determine  the  sur¬ 
face  to  be  fitted  at  each  locality.  Such  techniques  are  called 
global  methods,  whereas  those  that  use  only  local  data  are  lo¬ 
cal  methods.  Our  applications  require  that  we  calculate  surface 
curvature  from  our  fitted  surface.  The  technique  we  present  here 
is  a  global  method  for  surface  fitting. 

In  summary,  we  address  the  problem  of  fitting  a  surface  to  a 
large  data  set  composed  mostly  of  regularly  spaced  data  points, 
but  which  also  includes  grid  points  at  which  we  have  no  data, 
ami  non  grid  points  where  data  values  are  known.  The  data  are 
acquired  through  a  collection  process  that  is  assumed  to  yield 
accurate  values,  but  for  which  we  choose  not  to  characterize  the 
data  further.  We  require  a  solution  that  is  smooth  and  from 
which  we  can  calculate  the  first  and  second  surface  derivatives. 
We  present  details  of  a  global  interpolation  method  that  is  com¬ 
putationally  efficient  and  appears  to  applicable  to  a  broad  range 
of  tasks.  Although  the  general  form  of  the  method  applies  to 
non  gridded  data  our  computationally  efficient  algorithm  comes 
from  exploiting  the  regularity  of  the  data  points. 

We  commence  by  considering  the  multiquadric  method  in¬ 
vented  by  Hardy  |l]  for  modeling  natural  terrain.  In  its  general¬ 
ized  form,  we  examine  it  under  the  restriction  of  regularly  spaced 
data  points  and  derive  an  algorithm  to  solve  for  the  unknown 
parameters.  We  show  how  to  generate  the  interpolated  surface 
in  an  efficient  manner. 


2.  Surface  Interpolation 

2.1  Hyperbolic  Multiquadrics 

Suppose  we  have  a  set  of  data  points,  ((*,-,  y,,  z,')]"Jq  in  3-D 
space  to  which  we  wish  to  fit  a  hyperbolic  multiquadric  surface 
|l)  defined  by 

*(»•¥)”  ¥>  +  *)* 

where  dy(x,y)  «•  (x  —  xy)a  +  (y  -  yy)a,  h  is  a  user-specified 
constant,  and  ey’s  are  the  coefficients  that  must  be  determined. 

To  understand  this  method,  let  us  suppose  that  h  —  0.  The 
data  are  fitted  by  placing  a  cone  at  each  of  the  n  data  points 
so  that  the  cone's  axis  is  aligned  with  the  t  axis  direction  and 


the  cone's  apex  is  in  the  z  ■—  0  plane.  That  is,  the  data  are 
fitted  with  a  set  of  cones,  some  of  which  are  inverted.  The  z 
value  of  the  constructed  surface  at  position  (x, y)  is  calculated 
by  summing  the  z  values  contributed  by  each  of  the  n  cones  at 
this  (r,  y)  position. 

Each  cone  has  one  free  parameter,  namely,  its  apex  angle; 
we  determine  these  apex  angles  by  requiring  that  the  constructed 
surface  pass  exactly  through  the  data  points.  In  the  foregoing 
expression,  the  ey’s  correspond  to  the  apex  angles  of  the  cones. 
We  calculate  the  ey’s  by  solving  the  nxn  system  of  linear  equa¬ 
tions 

Vi)  +  M*  ”  *.  •  —  0 . n  -  1 

Note  that  this  fitting  technique  does  not  require  that  the 
data  be  regularly  spaced;  furthermore,  when  h  yt  0,  hyper¬ 
boloids  rather  than  cones  are  fitted  to  the  data.  Cones  and 
hyperboloids  are  not  the  only  options.  Stead  [2],  for  example, 
has  generalized  this  method,  using  the  form 

*(*.*)  dj(x,y)  +  h]* 

2.2  General  Form 

We  examine  surface-fitting  techniques  that  use  the  general 
form  of  the  above  method,  namely, 

*(*>  If)  —  ~  *i '  V  ~  *j) 

where  the  kernel  function  g  is  any  function  of  the  parameters 
x  -  xy,y  -  yy.  Clearly,  the  previously  defined  functions  are 
particular  cases  of  this  form.  As  before,  we  determine  the  cy’s 
by  solving  the  nxn  system  of  linear  equations 

‘oeM*i  ~  *i<  Vi  ~Vj)—  *i  •  —  0 . ”  ~  1 

For  large  values  of  n  it  is  not  feasible  to  solve  this  system 
of  equations.  In  our  applications  n  may  be  10,000.  However, 
for  smaller  n  we  have  used  the  above  form  to  ‘‘patch’'  holes  in 
a  regular  grid  of  data  points.  While  any  kernel  function  can  be 
employed,  we  have  found  it  important  to  match  the  method  used 
to  solve  the  nxn  system  of  linear  equations  to  the  form  of  the 
kernel  function  selected.  The  numerical  difficulties  encountered 
in  solving  some  of  the  systems  of  equations  produced  by  a  par¬ 
ticular  kernal  can  often  be  averted  by  exploiting  properties  of 
the  linear  system  stemming  from  the  choice  of  kernel  function. 
For  example,  if  we  use  the  Gaussian  function  as  the  kernel,  the 
symmetric  positive  definite  coefficient  matrix  of  the  system  of 
linear  equations  allows  solution  by  the  ‘square-root''  method 
(see,  for  example  [3]),  and  avoids  the  numerical  problems  created 
by  Gaussian  elimination.  If  we  impose  the  restriction  that  the 
data  points  must  be  gridded,  we  can  find  feasible  solution  tech¬ 
niques  even  when  n  is  of  the  order  of  millions. 

2.3  Regular  Grid  Solution 

Consider  the  problem  of  fitting  the  surface 

*(*, If)  =  e« »(*  "  *«■  V  ~  Vij)  . 
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where  ((z,-,y,  o.^lTo  *®  an  nxm  regular  grid,  to  the  data  set 

Vi,>.  2i.j )!?— o.7--o  ^  caD  ®nt*  an  expression  for  calculat¬ 
ing  the  c,y's  in  the  following  manner. 

Let  G(u, e)  denote  the  discrete  Fourier  transform  (DFT)  of 
y(z,y).  Using  the  shift  theorem  of  DFT  theory,  we  note  that  the 
DFT  of  g(z  —  z.-^.y  — Vij)  is  *)•' If  Z(u,v) 
denotes  the  DFT  of  z(z,  y),  we  can  write  the  DFT  of  Equation 
(1),  namely, 


Z{ukj,vk,  i)  = 
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Removing  from  the  summation,  we  have 


_  Z(ukj,  Vk,,) 

Z_>._,oZ^y_o  G(o*,i,t»s,i) 

Taking  the  inverse  DFT  of  the  above  expression,  we  obtain 

En—  1  «  ^m—  1  *  > n  —  I  r  1 

t=oZw_o  Z^i— o2_<j— o  c,'*m 

En-l^m-lZ(ua,|,tta,|)  ;„■(  *»•»*>•«  +  ) 

t_oZ*i_°  G(ukfi,vkfi)e 

Using  F_l  to  represent  the  inverse  DFT,  F_,[§|^|(z,,f,pp,f ) 
is  the  inverse  DFT  of  q|“'* j ,  calculated  at  (xp,flyp,,).  We  have 
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Now  if  tfj  =  jrpf,  and  ytj  =  yPif, 
r'-t  r"-'  -i.ii^+2^1  |  _  , 


otherwise 


Hence 


nm 


Z(u,  v) 
G(u,w) 


)(*«.»«) 


(2) 


An  alternate  way  of  viewing  the  above  derivation  is  to  note 
that  j(zpl  —  z{, j.Vr.i  —  Vi.j )  forms  a  circulant  matrix,  and  to 
recall  that  such  matrices  are  diagonalized  by  the  discrete  Fourier 
transform  [4]. 


2.4  Surface  Rendering 

Once  the  c,j’s  have  been  calculated,  Equation  (1)  provides 
an  analytic  expression  for  the  constructed  surface.  We  can  cal¬ 
culate  z(z,y)  for  any  (z,y)  position.  However,  each  such  calcula¬ 
tion  involves  the  sum  of  nm  terms.  If  it  is  our  intention  to  use 


this  analytic  expression  for  surface  interpolation  we  may  have 
to  calculate  this  sum  a  very  large  number  of  times.  The  cost 
savings  gained  in  computing  the  Ctj  coefficients  by  means  of  the 
DFT  (implemented  by  the  fast  Fourier  transform)  will  be  offset 
by  the  cost  of  these  summations.  As  a  rule,  if  we  commence 
with  data  on  a  regular  grid  we  want  an  interpolated  surface  on  a 
finer  grid.  This  results  in  considerable  savings  which  are  realized 
when  the  DFT  is  again  employed. 

Suppose  we  want  to  interpolate  each  grid  interval  in  z  and 
y  at  an  additional  number  of  points  so  that  the  final  surface  is 
calculated  on  a  rnxsm  grid.  Consider  Equation  (1),  revised  for 
the  new  ,  larger  grid: 

Ly.o  •  (3> 

where 


c[j  =ei+rj+,  t  (mod  r)  —  0 ,j  (mod  «)  =  0, 

=0  otherwise. 

That  is,  we  assume  the  surface  is  constructed  by  the  placing 
of  objects  at  each  of  the  new  grid  points,  but  zero  coefficients 
are  associated  with  all  objects  except  those  placed  at  the  original 
data  points.  Now,  taking  the  DFT  of  Equation  (3),  we  get 


for  k  —  0, ...,  rn  —  1 

/  —  0, ...,  »m  —  1, 

i.e., 

Z(uk,t,  vkj)  —  G'( ukj,  vk'i)C'(uk,i,  vkii)  , 

where  G'(u,v)  is  the  DFT  of  f(z,y),  defined  on  the  finer  grid, 
and  C'(u,  u)  is  the  DFT  of  the  array  e'  j. 

The  interpolated  surface  can  be  formed  by  taking  the  in¬ 
verse  DFT  of  the  above  expression: 

“  F_l  [G'fuj.i,  ni.i)C'(ut,i,  u4  |)| 


Note  that,  in  finding  the  c/j't  in  Equation  (2),  we  took  the 
inverse  DFT  of  ,  then  stretched  these  Cj.y’s  by  adding  zeros 
at  the  points  corresponding  to  the  new  interpolation  points,  and 
finally  took  the  DFT  of  the  stretched  coefficients  to  calculate 
C-'(u,  e).  These  steps  are  in  fact  unnecessary,  for  we  can  calculate 
the  C'fu.t/J's  directly  from  the  of ’*■  The  similarity  thereom 
of  DFT  theory  (5)  is  required: 

im°i  (4) 

r*  G(u*,i  (mod  n),  vk, i  (mod  m)) 

k  —0, rn  —  1 
/  =0 . sm  —  1. 


2.5  Algorithm 

We  can  now  write  down  our  interpolation  algorithm: 
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1.  Given  the  data  array  z(tij,  y<j),  find  its  DFT,  Z(u{j,Vij) 

2.  Find  the  DFT,  G(u,-,y,u,tJ),  of  the  kernel  function,  tij) 

3.  C(u,y.u,y)-§|^ 

4.  Calculate  C'(ui^,Vij),  using  Equation  (4) 

5.  Calculate  G'(u<,y,  Vtj)  for  the  larger  interpolation  grid 

6.  e,  j)  =  C  '\uijt  tt<,j)G,(tiity,  Vij) 

7.  Find  the  interpolated  surface  by  taking  the  inverse  DFT  of 

7  ( y ,  Vj  j  ). 

Note  that  for  selected  kernel  functions  Steps  2  and  5  could 
be  precomputed  for  standard-size  grids.  As  an  alternative,  these 
steps  can  sometimes  be  accomplished  by  analytic  means  if  the 
analytic  form  of  the  kernel  function  is  known. 


3.  Discussion 

For  purposes  of  illustration,  let  us  compare  the  computa¬ 
tional  efficiency  of  this  method  on  a  regular  grid  with  the  cost 
of  the  usual  non  gridded  formulation  of  the  multiquadric.  Of 
course,  since  the  usual  formulation  deals  with  irregularly  spaced 
data,  we  would  not  expect  it  to  compare  favorably  with  this 
method;  such  a  comparison  nevertheless  confirms  the  advantages 
of  our  technique.  Consider  a  square  nxn  grid  of  data  points  on 
which  we  want  an  interpolated  surface  over  a  rnxrn  grid.  The 
usual  multiquadric  formulation  solves  a  n2xn2  system  of  linear 
equation  at  a  cost  proportional  to  n®,  and  calculates  rnxrn  sums 
of  terms  at  a  cost  proportional  to  ran4.  If  it  is  assumed  that 
n  >  r,  this  cost  is  dominated  by  the  n®  term. 

The  algorithm  outlined  above  is  dominated  by  the  cost  of 
the  DFTs  in  Steps  5  and  7.  We  use  the  fast  Fourier  transform  to 
implement  the  DFT.  This  means  that  we  pad  our  data  with  zeros 
to  force  the  dimension  size  of  the  grid  to  be  a  power  of  2.  At 
worst,  our  grid  is  2rnx2rn.  The  cost  of  the  DFT  is  proportional 
to  4r2n2/oy2rn.  Even  if  r  were  as  great  as  n,  this  cost  would  be 
proportional  only  to  n*lojn.  From  an  empirical  standpoint,  the 
algorithm  outlined  is  faster  for  n  (and  k)  of  the  order  of  10. 

The  outlined  algorithm  places  little  limitation  on  the  type 
of  kernel  function  employed.  Not  only  smooth,  but  also  rough 
functions  may  comprise  the  basic  objects  from  which  the  sur¬ 
face  is  built.  W‘  have  used,  inter  alia,  cones,  hyperboloids, 
and  Gaussian-shaped  objects,  some  of  which  had  fractal  texture 
added  to  them.  In  Figures  1-4  we  show  profile  plots.  Figure  1 
shows  a  real  surface,  Figure  2  the  sampling  grid  we  used  to  select 
data  points.  In  Figure  2  the  profiles  depict  what  would  have 
been  obtained  if  we  had  used  bilinear  interpolation  to  build  the 
surface.  Figure  3  reveals  the  resultant  surface  when  Gaussian 
kernel  functions  were  used,  while  Figure  4  was  obtained  with 
a  kernel  function  that  had  fractal  texture  added  to  a  Gaussian 
base.  When  we  compare  the  fitted  surface  to  ground  truth,  the 
average  error  for  the  smooth  kernel  functions  used  by  us,  is  ap¬ 
proximately  one  percent  of  the  data  height  range.  As  with  any 
fitting  technique,  we  cannot  construct  surface  features  that  are 
not  described  by  the  sampled  data. 

We  indicated  that  one  reason  for  investigating  global  sur¬ 
face  interpolation  techniques  was  the  need  to  calculate  reliable 
estimates  of  surface  curvature.  In  our  preliminary  trials  with 


synthetic  surface  data  the  constructed  surface  appears  to  allow 
adequate  surface  curvature  estimation.  This  will  be  tested  fur¬ 
ther  in  future  applications. 

4.  Conclusions 

We  have  presented  a  method  of  surface  interpolation  that 
is  computationally  efficient.  The  reconstructed  surface  is  fitted 
globally  to  enable  the  data  rather  than  an  implicit  surface  model 
to  control  the  construction  process.  The  method  makes  it  pos¬ 
sible  to  build  not  only  the  more  customary  smooth  interpolated 
surface,  but  the  roughly  textured  type  as  well. 

Surface  reconstruction  methods  provide  a  means  of  us¬ 
ing  the  hypothesis-and-test  approach  in  image  analysis.  They 
provide  a  mechanism  for  using  image  information  that  only  con¬ 
strains  rather  than  specifies  3-D  world  parameters.  The  outlined 
algorithm  is  a  tool  for  hypothesizing  a  broad  range  of  surface 
types. 
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Figure  1.  Real  Surface  Used  as  Ground  Truth 


Figure  ft.  Reconstructed  Surface  by  Means  of  Gaussian 
Kernel  Functions. 


Figure  1.  Data  Sampled  at  Grid  Intersection!  Figure  4.  Reconstructed  Surface  bjr  Mean,  of  Fractal 

Textured  Gaussian  Kernel  Functions. 
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Abstract 

Recent  research  at  Columbia  University  has 
demonstrated  that  fine-grained  tree-structured  SIMD 
architectures,  which  have  favorable  characteristics  for 
efficient  VLSI  implementation,  can  be  used  for  the 
rapid  execution  of  a  wide  range  of  image 
understanding  tasks  The  NON-VON  supercomputer, 
currently  being  constructed  at  Columbia  University,  is 
an  example  of  such  an  architecture  In  this  paper,  we 
describe  two  algorithms  for  the  implementation  of  the 
Hough  transform  method  on  NON-VON  The  first 
one  may  be  regarded  as  a  parallel  implementation  of 
the  standard  sequential  machine  algorithm  The 
second  algorithm  treats  the  NON-VON  tree  as  an 
independent  set  of  subtrees,  resulting  in  a  more 
efficient  algorithm  in  terms  of  both  execution  time 
and  the  number  of  processing  elements  required  A 
parallel  high  level  language  based  on  PASCAL  is  used 
to  describe  our  algorithms  A  description  of  certain 
special  architectural  features  of  the  NON-VON 
machine  that  have  proven  useful  for  image 
understanding  algorithms  is  also  presented 


1  Introduction 


An  important  goal  for  researchers  in  computer  vision 
is  to  construct  vision  systems  that  receive  an  image  or 
a  sequence  of  images  from  a  sensory  device  and 
output  an  interpretation  of  this  input  in  real  time 
Input,  images  with  reasonable  resolution  contain  large 
quantities  of  data,  and  conventional  von  Neumann 
machines  require  an  excessive  amount  of  time  to 
sequentially  process  the  fetched  data  Image 
understanding  applications,  however,  usually  involve 
computations  that  can  be  performed  simultaneously  on 
many  or  all  of  the  image  elements  Consequently, 
parallel  computers  are  highly  desirable  for  fast 
execution  of  image  understanding  tasks 
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A  computer  implementation  of  a  complete  vision 
system  requires  not  only  the  performance  of  many 
computations  on  large  structured  arrays  of  raw  image 
data  at  the  lowest  level,  but  also  sophisticated 
decision  making  at  the  highest  level  Recent  advances 
in  very  large  scale  integrated  (VLSI)  circuitry  have 
lead  to  a  surge  in  research  aimed  at  developing  new 
computer  organizations  that  meet  the  large 
computational  requirements  of  image  analysis  tasks 
Various  kinds  of  special  parallel  machines  for 
computer  vision  have  been  proposed  and  some  of 
them  have  been  implemented,  examples  are  described 
in  (Duff  76|,  (Mark  80],  (Pott  83),  and  (Kush  82) 

Recent  research  at  Columbia  University  has 
demonstrated  that  fine-grained  tree-structured  single 
instruction  .stream  multiple  data  stream  (SIMD) 
(Fly n  72)  computer  architectures  can  be  used  for  the 
rapid  execution  of  a  wide  range  of  vision  system 
tasks  The  term  "fine-grained  machines”  refers  to 
machines  with  a  very  large  number  of  very  small 
processing  elements  (PE's)  A  detailed  discussion  of 
the  efficient  VLSI  implementation  of  tree-organized 
machines  can  be  found  in  (Ibra  83)  The  NON-VON 
supercomputer,  currently  being  constructed  at 
Columbia  University,  is  a  representative  example  of 
this  class  of  architectures^  Several  parallel  image 
understanding  algorithms  have  been  developed  and 
implemented  on  a  simulator  of  the  NON-VON 
machine  The  tasks  have  been  selected  to  span 
different  levels  of  computer  vision  These  algorithms 
incorporate  novel  approaches  to  reduce  the  effects  of 
the  communication  bottleneck  usually  associated  with 
tree  architectures  Issues  related  to  the 

implementation  of  these  algorithms,  such  as  image 


representation 
(Ibra  84b] 
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del'  ctmg  object  boundaries  described  by  parameterized 
curves  (Other  tree-based  parallel  image 

understanding  algoriti  ns  have  also  been  developed 
for  the  NON-VON  machine  They  include  the  gray¬ 
scale  image  histogram,  image  correlation,  connected 
component  labeling,  computing  Euler’s  number  for 
images,  set  operations,  and  the  computation  of  the 
geometric  properties  of  objects  ) 

In  the  following  section,  the  architecture  of  NON- 
VON  is  described  and  compared  with  other  proposed 
hierarchical  architectures  for  vision  In  Section  Three, 
algorithms  for  the  Hough  transform  method  are 
described 

2  The  NON-VON  Supercomputer 
Architecture 

The  nam'  NON-VON  is  used  to  describe  a  family  of 
massively  parallel  tree-structured  machines  intended  to 
support  large  scale  data  manipulation  [Shaw  84a] 
The  architectures  of  all  the  NON-VON  family 
members  include  a  tree-structured  Primary  Processing 
Subsystem  (PPS)  based  on  custom  VLSI  circuits,  along 
with  a  Secondary  Processing  Subsystem  (SPS)  based 
on  a  bank  of  intelligent  disk  drives  Figure  1  depicts 
the  top-level  organization  of  the  NON-VON 
architecture 

The  PPS  is  configured  as  a  binary  tree  of  small 
processing  elements  (SPE’s)  Each  SPE  comprises  a 
small  RAM  (64  bytes  in  the  prototype  now  under 
development),  a  modest  amount  of  processing  logic, 
and  an  I/O  switch  The  I/O  switch  can  be  set  for 
different  modes  of  communication,  as  will  be  described 
in  the  next  subsection  The  SPS  is  based  on  a 
number  of  rotating  storage  devices  Associated  with 
each  disk  head  in  the  SPS  is  a  separate  sense 
amplifier  and  a  small  amount  of  logic  capable  of 

dynamically  examining  the  data  passing  beneath  it 
[Shaw  82]  This  organization  supports  parallel 
transfer  of  data  between  the  PPS  and  SPS,  which  is 
necessary  to  keep  I/O  from  becoming  a  bottleneck 

NON-VON  1  and  NON-VON  3,  the  first  two 

members  of  the  NON-VON  family,  include  a  single 
special  control  processor  (CP)  at  the  root  of  the  tree 
The  CP  is  responsible  for  coordinating  d.iferent 
activities  within  the  PPS  It  is  capable  of 

broadcasting  instructions  to  be  executed  in  all  active 
PE's  simultaneously  Thus,  NON-VON  1  and 

NON-VON  3  function  for  most  part  as  SIMD 

machines,  with  all  SPE's  simultaneously  executing  the 
same  instruction  Algorithms  that  use  this  mode  of 
execution  are  referred  to  as  SIMD  algorithms 
NON-VON  3  executes  about  four  million  instructions 


per  second  [Shaw  84b]  This  number  is  used 
throughout  the  paper  to  compute  the  time  required  to 
execute  the  developed  algorithms 

The  first  member  of  the  NON-VON  family, 
NON-VON  1,  contains  chips  with  only  one  PE,  and  is 
being  constructed  primarily  to  evaluate  certain 
electrical,  timing,  and  layout  area  characteristics  The 
chip  has  already  been  tested  and  has  been  proven 
functional  A  modified  version  of  the  chip  with  eight 
PE’s  has  been  designed  for  use  in  NON-VON  3  (a 
detailed  description  can  be  found  in  [Shaw  84b]  The 
modified  chip  has  less  area  per  PE,  and  the 

instruction  set  is  made  more  powerful  by  generalizing 
register-to-register  data  transfers  and  adding  more 
arithmetic  processing  power  Algorithms  described  in 
this  paper  are  based  on  the  NON-VON  3  architecture 
and  instruction  set 

In  the  following  subsection,  we  describe  the 

communication  modes  in  the  NON-VON  tree,  and 
present  a  brief  description  of  a  PASCAL-based  parallel 
language  that  is  used  in  describing  our  algorithms 

2.1  Communication  Modes  in  the  NON- 
VON  Tree 

Inter-PE  communication  in  NON-VON  is  supported  by 
the  I/O  switch,  which  is  a  matrix  of  pass  transistors 
that  routes  data  between  the  two  internal  buses  and 
the  I/O  ports  The  NON-VON  I/O  switch  supports 
three  modes  of  communication 

1  Global  bus  communication, 
supporting  both  broadcast  by  the 
CP  to  all  PE’s  in  the  PPS  as 
required  for  SIMD  execution,  and 
data  transfers  from  a  single  selected 
PE  to  the  CP  No  concurrency  is 
achieved  when  data  is  transferred 
from  one  PE  to  another  through  the 
CP  Osmg  the  global  communication 
instructions  An  instruction  called 
RESOLVE  can  be  used  to  disable 
all  but  a  single  PE  chosen  from 
among  a  specified  set  of  PE’s  This 
is  an  example  of  a  hardware 
multiple  match  resolution  scheme,  m 
the  terminology  of  the  literature  of 
associative  processors  The  CP,  on 
executing  a  RESOLVE  instruction,  is 
able  to  determine  whether  the 
operation  resulted  in  any  PE  being 
enabled  or  not  The  REPORT 
instruction  transfers  data  from  the 
single  chosen  PE  to  the  CP  using 
global  bus  communication 
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2  Tree  communication,  supporting  data 
transfers  among  PE’s  that  are 
physically  adjacent  within  the  PPS 
tree  Instructions  support  data 
transfers  to  the  Parent  (P),  Left 
Child  (LC),  and  Right  Child  (RC) 

PE’s  Full  concurrency  is  achieved  in 
this  mode,  since  all  nodes  can 
communicate  with  their  physical  tree 
neighbors  in  parallel 

3  Linear  communication  in  which  the 
whole  tree  is  reconfigured  as  a 
linear  array  of  PE’s  This  mode  of 
communication  supports  data 
transfers  to  the  Left  Neighbor  (LN) 
or  Right  Neighbor  (RN)  PE’s  in  the 
linear  array  Linear  communication 
is  useful  for  applications  that  require 
a  predefined  total  ordering  of  data 
Figure  1  shows  how  the  linear 
logical  sequence  has  been  mapped 
onto  the  tree  structured  physical 
topology  of  the  PPS  by  inorder 
enumeration  [Knut  73] 

The  original  NON-VON  architecture,  which  was  not 
intended  for  computer  vision  applications,  differs  from 
other  proposed  highly  parallel  hierarchical  image 
understanding  architectures  (for  example,  [Tam  83])  in 
that  it  does  not  employ  any  extra  physical  links  to 
perform  mesh  neighbor  communication  This  is 
advantageous  from  a  hardware  point  of  view,  as  it 
results  in  a  fixed  number  of  pins  per  integrated 
circuit  chip,  independent  of  the  number  of  PE’s  on 
that  chip  This  makes  it  possible  to  increase  the  size 
of  the  tree  as  chip  dimensions  scale  down  by  simply 
embedding  more  PE’s  on  the  chip  Increasing  the 
machine  size  involves  only  removing  the  old  PE  chips 
and  plugging  in  the  new  ones  On  the  other  hand, 
the  lack  of  mesh  connections  slows  many  local 
operations  in  which  the  output  value  at  an  image 
point  depends  on  its  o'  n  image  value  and  that  of 
neighbor  points  Plans  for  a  vision-oriented  variant  of 
the  NON-VON  3  that  includes  mesh  connections  are 
now  being  formulated  These  hardware  modifications 
with  alternative  algorithms,  will  be  discussed  in 
[Ibra  84c] 

NON-VON’s  other  special  hardware  features,  including 
its  ability  to  be  configured  logically  as  a  linear  array, 
its  fast  global  instruction  broadcast,  and  its  multiple 
match  resolution  scheme,  have  proven  useful  in  the 
algorithms  we  have  developed  to  overcome  some  of 
the  problems  related  to  the  communication  bottleneck 
generally  associated  with  trees 
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In  the  following  section  a  PASCAL- based  parallel 
language  referred  to  as  NV-PASCAL,  is  used  to 
describe  the  NON-VON  Hough  transform  algorithms 
presented  in  this  paper  The  NV-PASCAL  contains 
some  features  that  simplify  and  make  clear  the  < 

presentation  of  our  NON-VON  algorithms  In  the 
remaining  part  of  this  section,  we  describe  briefly 
some  features  of  NV-PASCAL  that  are  relevant  to  the 
algorithms  described  in  this  paper 

The  language  has  been  designed  to  be  used  on  SIMD  j 

architectures  The  principal  idea  behind  the  design  of 

NV-PASCAL  has  been  to  create  features  that  make 

full  use  of  the  machine’s  parallel  capabilities  while 

retaining  all  of  the  high  level  constructs  of  PASCAL 

(Baco  82]  For  that  purpose,  one  new  data  type  and 

five  extra  constructs  have  been  added  to  standard 

PASCAL  to  enable  the  user  to  make  full  use  of  the 

parallelism  of  the  machine  Built-in  functions  that  use 

the  NON-VON  instruction  set  allow  the  programmer 

to  explicitly  make  use  of  the  NON-VON  tree 

structure  The  NV-PASCAL  compiler,  developed  for 

the  NON-VON  supercomputer,  produces  LISP  code 

that  is  executed  on  the  existing  LISP  NON-VON 

Primary  Processing  Subsystem  (PPS)  simulator  On 

the  real  NON-VON  machine,  the  LISP  code  generated 

will  actually  cause  broadcast  of  the  NON-VON 

machine  instructions  to  the  PPS  [ 

Variables  whose  values  are  stoied  in  the  tree  PE’s  are  * 

referred  to  as  vector  variable s  while  scalar  variables  i 

refer  to  those  variables  stored  in  the  CP  (Vector 
variables  thus  represent  a  vector  of  values,  stored 
once  per  PE,  while  scalar  variables  are  not 

replicated  )  Italics  are  used  to  refer  to  scalar 
variables,  and  capital  letters  to  refer  to  vector  ; 

variables  ( 

There  are  two  types  of  statements  in  NV-PASCAL 
sequential  and  parallel  The  sequential  statements  are 
those  of  standard  PASCAL,  while  the  parallel 
statements  are  those  that  operate  on  vector  variables 
We  describe  here  only  the  two  of  these  statements 
that  are  used  in  this  paper  The  assignment 
statement  can  be  either  sequential  or  parallel  The 
sequential  statement  is  the  assignment  statement 
encountered  in  standard  PASCAL  The  parallel 
assignment  statement  is  one  that  refers  to  a  variable 
that  is  defined  as  a  vector  variable  The  parallel 
assignment  statement  is  executed  concurrently  in  all 
active  PE’s  in  the  machine 
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The  WHERE  statement  is  the  conditional  statement 
that  operates  only  on  vector  variables  The  form  of 
the  WHERE  statement  is  as  follows 

WHERE  ^conditional  expression^* 

DO  <statemenl> 

/  ELSEWHERF  <statement>  ]  ; 

It  is  used  to  first  select  only  those  PE’s  with  vector 
variables  that  satisfy  the  boolean  expression  The 
statement  following  the  DO  is  then  exe<  uted  only  in 
these  PE’s  The  statement  following  the  optional 
ELSEWHERE  is  executed  in  the  subset  of  the  PE's 
that  failed  to  satisfy  the  original  conditional 
expression 

An  important  point  to  remember  is  that  the  WHERE 
statement  generally  executes  both  the  statement 
following  the  DO  and  the  statement  following  the 
ELSEWHERE  (Unless  either  all  or  none  of  the  PE's 
satisfy  the  condition  ) 

Built-in  functions  based  on  the  NON- VON  instruction 
set  are  used  to  implement  operations  that  use  the  tree 
communication  modes  of  the  machine  Function 
names  that  start  with  'N-'  correspond  to  NON-VON 
machine  instructions,  and  their  parameters  correspond 
to  the  arguments  of  these  instructions 

3  The  Hough  Transform  on  NON- 
VON 

The  Hough  transform  method  is  used  frequently  in 
image  understanding  tasks  to  detect  the  shape  of 
object  boundaries  described  by  parametric  curves 
This  method  is  based  on  the  duality  between  points 
on  a  curve  and  the  parameters  of  that  curve  In  his 
initial  work  Hough  |Houg  62)  described  a  method  for 
detecting  straight  lines  in  an  image  using  the  slope- 
intercept  parameterization  of  the  line  According  to 
this  parameterization,  t*>-  line  equation  is  expressed 
as 

y  =  mx  +  c 

Suppose  that  we  have  a  set  of  image  points  {(x,,y,), 
,  (xn,j,n)}  that  have  a  likelihood  of  being  on  linear 
boundaries.  In  this  paper,  we  refer  to  these  points  as 
boundary  points  The  Hough  transform  method 
organizes  the  boundary  points  into  a  set  of  straight 
lines  as  follows  Consider  a  boundary  point  in 

the  image  plane  The  parameters  of  all  lines  passing 
through  this  point  must  satisfy  the  equation 

»,  =  +  c 

This  equation  corresponds  to  a  straight  line  in  the  m- 
c  space  (the  parameter  space )  Thus,  the  set  of 


boundary  points  in  the  image  plane  corresponds  to  a 
set  of  lines  in  the  m  e  plane  If  two  boundary  points 
are  on  a  line  AB  in  the  image  plane  with  parameters 
m  e,,  then  the  two  lines  corresponding  to  these  two 
points  in  th*  m-e  plane  would  intersect  at  the  point 
(m^c,)  In  fact,  all  boundary  points  in  the  image 
plane  on  the  same  line  AB  map  to  lines  in  the  m-e 

plane  that  intersect  at  the  point  (»»,,«,)  Thus,  the 

problem  of  finding  the  set  of  lines  in  the  image  plane 
is  reduced  to  that  of  finding  common  points  of 
intersection  of  lines  in  the  parameter  space  A  better 
parameterization  of  a  straight  line  is  surgested  by 
Duda  [Duda  72]  in  which  the  parameters  «-p  are 
used  where  t  is  the  angle  of  the  line  normal  and  p  is 
the  algebraic  distance  from  the  origin  The  advantage 
of  this  parameterization  is  that  the  values  of  6  and  p 

are  bounded,  while  in  the  case  of  m-e 

parameterization  the  values  are  not  bounded  The 
Hough  transform  can  be  extended  to  detect  other 
curves  of  analytical  parameters  [Ball  75],  or  to  detect 
general  curve  shapes  using  edge  orientation  at  the 
image  points  and  a  reference  point  [Ball  81]  A 
memory  efficient  implementation  of  the  Hough 
transform  on  sequential  machines  is  described  in 
[Brow  84]  A  parallel  algorithm  based  on  the  Hough 
transform  for  detecting  a  general  curve  with  specific 
orientation  has  been  developed  by  Merlin  et  al 
[Merl  75] 

The  implementation  of  the  Hough  transform  for 
detecting  straight  lines  on  a  sequential  machine 
involves  a  quantization  of  the  parameter  plane  into  a 
quadruled  grid  The  grid  size  is  determined  by  the 
acceptable  errors  in  the  parameter  values,  and  the 
quantization  is  confined  to  a  specific  region  of  the 
parameter  plane  determined  by  the  range  of 
parameter  values  A  two-dimensional  array  (the 
accumulator  array )  is  then  used  to  represent  the 

parameter  plane  grid,  where  each  array  entry 
corresponds  to  a  grid  cell  For  each  boundary  point, 
the  algorithm  on  a  sequential  machine  increments  the 
counts  in  all  accumulator  array  entries  that 
correspond  to  grid  cells  along  the  straight  line  in  the 
parameter  plane  After  this  step,  grid  cells 
corresponding  to  the  accumulator  array  entries  where 
the  count  exceeds  a  certain  threshold  value  are 

selected  as  the  set  of  parameters  for  the  image 
straight  lines  being  sought  The  increment  of 
accumulator  array  counts  can  be  thought  of  as  a 
process  of  "voting”  by  the  boundary  points  for  the 

parameter  values  of  possible  curves  passing  through 
these  points  The  time  required  to  execute  this 

algorithm  on  a  sequential  machine  is  proportional  to 
the  size  s  of  the  grid,  plus  the  number  m  of 
boundary  points  times  the  number  of  votes  ti  of  each 
point  (0(s+mt>))  Memory  space  required  is 
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proportional  to  the  size  of  the  grid 
In  what  follows  we  describe  two  parallel  algorithms 
to  implement  the  Hough  transform  on  NON-VON 
The  first  one  is  a  direct  parallel  imf  lemcntation  of 
the  standard  sequential  algorithm  The  disadvantages 
of  this  approach  are  presented,  and  we  describe  a 
second  approach  that  solves  these  problems  We 
assume  that  the  boundary  points  have  been  detected 
by  some  other  procedures  and  that  the  PE's  holding 
them  are  marked  using  a  special  flag  For  the  sake 
of  simplicity,  we  also  assume  that  the  curves  being 
sought  are  straight  lines  whose  equations  are  expressed 
using  the  slope  and  intercept  parameters 

3.1  The  NON-VON  Hough  Transform 
Algorithm  -  A  Direct  Approach 

In  the  sequential  machine  implementation  of  the 
Hough  transform,  each  boundary  point  casts  its  votes 
in  the  accumulator  array  by  incrementing  all  the 
entries  corresponding  to  grid  cells  along  the  parameter 
space  curve  associated  with  this  point  This  process  is 
repeated  for  all  image  boundary  points  Next, 
accumulator  array  entries  whose  count  exceeds  a 
specified  threshold  value  are  selected  We  now 
describe  how  this  algorithm  is  implemented  on  NON- 
VON  Each  NON-VON  PE  is  associated  with  a  grid 
cell  in  the  parameter  space  This  can  be  performed 
by  a  procedure  similar  to  the  address  initialization 
procedure  described  in  (Ibra  84a]  In  all  those  PE’s, 
a  vector  integer  variable  COUNT  is  initialized  to 
zero  The  coordinates  of  boundary  points  are  then 
reported  to  the  CP  one  point  at  a  time  using  the 
RESOLVE  instruction  The  coordinates  of  each 
reported  point  are  then  broadcast  to  all  PE’s  and  all 
those  PE  s  holding  a  grid  cell  across  the  curve  in  the 
parameter  space  corresponding  to  the  reported  point 
increment  the  vector  variable  COUNT  by  one  This 
step  is  performed  by  substituting  the  broadcast  values 
in  the  parameter  space  curve  equation  and  if  it 
satisfies  the  equation  then  COUNT  is  incremented 
Each  PE  whose  COUNT  variable  exceeds  the 
threshold  value  is  marked,  and  the  value  of  the  grid 
cell  associated  with  it  is  reported  to  the  CP  using  the 
RESOLV  E  and  REPORT  instructions  A  vector 
character  variable  HT  is  used  to  flag  those  boundary 
points  that  have  not  voted  yet  The  NON-VON 
PASCAL  algorithm  that  describes  the  procedure 
follows. 


Procedure  houghUth'esh  integer), 
var 

x,  y,  rn,  c:  integer, 
vector_  var 
COUNT,  X.  Y  integer, 


PARAMETER  boolean 
begin 

/*  1  Initialize  the  vector  variable  COUNT  in  all 
PE's  The  other  vector  variables  M,  C,  and  HT  are 
assumed  to  be  defined  and  initialized  by  the  calling 
procedure  */ 

COUNT  =  0 
PARAMETER  =  N’, 

/*  2  Enable  all  PE’s  in  which  the  boundary  points 
have  not  been  reported  yet  Report  the  coordinates 
of  a  single  boundary  point  using  the  RESOLVE 
instruction,  and  mark  this  point  as  reported  If  none 
is  enabled  then  all  the  boundary  points  have  been 
reported  In  this  case  start  computing  the  parameter 
values  using  the  threshold  value  */ 

step2 

where  HT  =  ‘Y’  do  N-Al  =  1, 
elsewhere  N-Al  =  0, 
ir  N- RESOLVE])  =  0  then 
go  to  step4, 
where  N-Al  =  1  do 
begin 

HT  =  N’, 

N-REnORT8(XADI>,  z), 

N-R EPOR T8(  YADD ,  y), 

end 

/*  3  Enable  all  PE’s  holding  the  grid  cells  Broadcast 
the  reported  image  point  value  Substitute  this  value 
in  the  equation  of  the  parameter  space  curve 
Increment  COUNT  in  all  PE’s  in  which  the  equation 
is  satisfied  Now  loop  to  select  another  boundary 
point  */ 

X  —  x, 

Y  =  y 

if  Y  =  |M  *  X  +  C)  then 

COUNT  =  COUNT  +  1, 

go  to  step2, 


/*  4  Broadcast  the  threshold  value  thresh  Mark  all 
PE’s  in  which  COUNT  >  thresh  */ 

step4  where  COUNT  >  thresh  do 
PARAMETER  =  Y’, 

/*  5  Report  the  grid  cell  values  held  by  these 
enabled  PE’s  one  by  one  using  the  RESOLVE 
instruction  */ 

steps 

where  PARAMETER  =  Y’  do 
N-Al  =  1, 
elsewhere  N-Al  =  0, 
if  N-RESOLVE()  =  0  then  returnf), 
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where  N-Al  =  1  do 

begin 

PARAMETER  =  N’, 

N-REPORT8(M  m), 

N-REPORT8(C,  e), 
go  to  step5, 

end, 

end 

Steps  2  through  4  are  executed  a  number  of  times 
equal  to  the  number  of  boundary  po.nts  6  Step  5 
executes  a  number  of  times  equal  to  the  number  of 
curves  found,  which  is  less  than  6  Thus,  the 
algorithm  takes  time  proportional  to  the  number  of 
image  boundary  points  (0(6))  The  NON-VON  3  code 
for  this  procedure  [Ibra  84c]  executes  about  200 
instructions  for  Steps  2  through  4  (50  /isec  at  4  Mhz) 
Of  these  200  NON-VON  3  instructions,  approximately 
160  implement  the  evaluation  of  the  straight  line 
equation  Step  5  executes  about  12  NON-VON  3 
instructions  for  each  set  of  parameter  values  found 
Thus,  if  the  image  contains  1000  boundary  points,  the 
execution  time  of  the  algorithm  is  approximately  53 
msec  The  number  of  PF's  required  by  this  approach 
is  equal  to  the  number  of  grid  points  If  the  the  grid 
size  is  larger  than  the  machine  size  by  a  factor  of  k, 
then  the  parameter  space  is  divided  into  k  parts 
The  above  procedure  is  then  executed  for  each  of 
these  parts  The  time  required  to  execute  the 
algorithm  in  this  case  is  0(1:6) 

One  disadvantage  of  this  approach  is  that  it  requires 
a  NON-VON  machine  of  size  comparable  to  the  grid 
size,  despite  the  fact  that  many  of  the  PE’s  will  never 
get  their  COUNT  incremented  Note  also  that  each 
time  a  boundary  point  is  broadcast  the  curve  equation 
has  to  be  evaluated  in  each  PE,  which  is  a  time 
consuming  operation  as  clear  from  the  numbers  cited 
earlier  The  second  approach  we  describe  below 
solves  these  problems  It  uses  a  number  of  PE’s  equal 
to  the  number  of  votes  cast  by  the  boundary  points, 
rather  than  the  grid  size,  and  the  curve  equation  is 
evaluated  only  once 

3.2  The  NON-VON  Hough  Transform 
Algorithm  -  A  MS1MD  Approach 

In  our  improved  approach,  the  NON-VON  tree  is 
treated  as  if  it  were  an  independent  set  of  subtrees, 
and  each  boundary  point  casts  its  votes  one  by  one  in 
one  of  these  subtrees  This  voting  process  is 
performed  concurrently  in  all  the  subtrees  Thus,  in 
time  proportional  to  the  number  of  votes  cast  by  each 
boundary  point,  all  votes  are  cast  and  stored 
throughout  the  tree  The  problem  of  finding  the 
parameter  values  which  exceed  the  threshold  value  is 


equivalent  to  that  of  finding  the  local  peaks  of  a  two- 
dimensional  histogram  Because  of  the  way  the  votes 
are  cast  in  this  second  approach,  we  refer  to  this 
algorithm  as  a  multiple-SIMD  (MSIMD)  algorithm 

The  size  of  these  subtrees  is  determined  by  the 
number  of  votes  cast  by  each  boundary  point  For 
example,  if  each  boundary  point  casts  60  votes,  then 
the  subtree  size  required  is  at  least  60  (A  subtree  of 
six  levels  will  suffice)  Boundary  points  are  stored  in 
the  roots  of  these  subtrees  This  can  be  performed  by 
more  than  one  method  The  simplest  one  is  to  report 
the  boundary  points  to  the  CP  one  by  one  using  the 
RESOLVE  instruction,  and  then  to  broadcast  them  to 
be  stored  in  the  roots  of  the  subtrees  There  are 
other  methods  to  perform  this  process  more  efficiently, 
but  they  are  beyond  the  scope  of  this  paper 

The  PE’s  in  these  subtrees  are  enumerated  in  such  a 
way  that  each  PE  in  a  subtree  is  assigned  a  unique 
address  (stored  in  the  local  variable  ADDRESS)  in  the 
range  [0,  max  num_tK><es],  where  max  _num  votes 
is  the  value  of  the  maximum  number  of  votes  casted 
by  each  point  The  enumeration  is  performed  in  such 
a  way  that  all  PE’s  in  the  same  relative  position 
within  these  subtrees  have  the  same  address  This 
enumeration  procedure  is  very  simple,  and  will  not  be 
described  in  this  paper  Time  required  by  this 

procedure  is  proportional  to  the  height  of  the  subtree 
We  now  describe  the  algorithm  for  storing  the  votes 
in  the  NON-NON  Lee 

We  assume  that  the  boundary  points  reside  in  the 
roots  of  the  subtrees,  with  the  PE’s  being  enumerated 
as  described  before  We  also  assume  that  the 
parameter  space  is  a  two-dimensional  one  The  vector 
integer  variables  X  and  Y  are  used  to  store  the  value 
of  the  boundary  points,  while  the  vector  variables  Ml 
and  M2  are  used  to  store  the  parameter  values  A 
scalar  variable  g_ml  stores  the  value  of  parameter 
Ml  to  be  broadcast,  and  the  scalar  constant 
delta  ml  is  the  increment  used  to  change  the  value 
of  g  ml  The  scalar  constant  h_subtree  is  the 
height"  of  the  subtree  The  NV-PASCAL  voting 
procedure  follows 

Procedure  hougb2, 
var 

i,  j,  g_ml:  integer, 
vector  _  var 

Ml,  M2,  X,  Y  integer, 
begin 

/*  1  Initialize  the  scalar  variables  The  scalar 

varnhles  h  subtree,  delta  ml,  max _num  votes  are 
initialized  by  the  calling  procedure  */ 
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«  =  0, 
g  ml  =  0, 

/*  2  Enable  all  PE’s  that  are  not  the  root  cl  some 
voting  subtree  The  variable  SUBTREE_ROOT  is 
assumed  to  be  set  by  the  calling  procedure  Set  X 
and  Y  in  each  child  equal  to  X  and  Y  in  its  parent 
Repeat  this  step  h_subtree  times  */ 

where  SUBTREE  ROOT  =  N'  do 
begin 

N-RECV8(P,  XADD,  X), 

N-RECV8(P,  YADD,  Y), 
for  j  =  1  to  h  _aublree-l  do 
begin 

N-RECV8(P,  X,  X), 

N-RECV8(P,  Y,  Y), 

end, 

end, 

/*  3  Enable  the  PE’s  with  ADDRESS  equal  to  i  in 
voting  subtrees  In  all  the  enabled  PE’s  store  the 
new  value  of  the  parameter  Ml  */ 

step3 

where  ADDRESS  =  i  do 
Ml  =  g  ml, 

/*  4  Increment  g_ml  by  delta_ml,  and  increment  « 
by  1  If  all  the  values  of  Ml  have  been  stored  in  the 
voting  PE's,  then  proceed  to  compute  the  value  of 
M2  in  those  PE’s,  otherwise  repeat  step  3  */ 

»'  =  •  +  1; 

g_ml  =  g_ml  +  delta  ml; 

If  I  <  max _num _votes  then 
go  to  step3, 

I*  5  Enable  all  PE’s  Using  the  values  of  X.  Y,  and 
Ml,  compute  M2  such  that  the  curve  equation  is 
satisfied  */ 

M2  =  compute_m2(X  ,  Y  ,  Ml), 

end 

Step  2  is  executed  a  number  of  times  equal  to  the 
subtree  height  (log  »),  where  v  is  equal  to  the  number 
of  votes  cast  by  each  point  Steps  3  and  4  are 
executed  a  number  of  times  equal  to  v  Thus,  the 
procedure  to  store  the  votes  in  the  subtree  takes  time 
of  O(u).  Note  that  step  5,  the  evaluation  of  the 
curve  equation,  is  executed  only  one  time.  If  the 
evaluation  of  the  curve  equation  results  in  more  than 
one  M2  value  for  each  value  of  Ml,  then  each  PE 
stores  more  than  one  parameter  set  values  This  case 
depends  on  the  parameter  space  curve,  and  should 
result  in  a  slightly  modified  version  of  the  algorithm 
to  compute  the  local  peaks  of  the  parameter 
histogram  described  later 


The  NON- VON  3  code  for  this  procedure  executes 
approximately  (10r  +  160)  NON-VON  3  instructions 
For  v  equal  to  100,  the  time  required  to  cast  the 
votes  in  the  tree  is  thus  about  0  3  msec  If  there  are 
more  votes  than  the  NON-VON  tree  size,  each  PE 

stores  more  than  one  vote  In  this  case,  if  each  PE 

stores  k  values,  then  the  time  required  to  execute  the 
above  procedure  is  O(Art’),  where  k  is  the  |atio 
betweer  he  total  number  of  votes  and  the  NON- 
VON  trie  size 

Next,  we  describe  how  to  find  the  parameter  values 
that  have  votes  exceeding  the  threshold  value  These 
values  occur  at  the  the  local  peaks  of  the  two 

dimensional  histogram  of  the  votes  for  Ml  and  M2 

We  assume  in  the  following  discussion  that  there  are 
few  of  these  local  peaks  This  is  a  realistic 
assumption  as  the  number  of  curves  being  sought  is 
usually  small  Figure  2  shows  such  a  histogram  In 
this  example,  there  are  a  few  areas  of  voting  activity 
(local  peaks)  A  direct  approach  to  the  identification 
of  these  local  peaks  involves  dividing  the  two 
dimensional  histogram  space  into  grid  cells  For  each 
grid  cell,  all  PE’s  with  Ml  and  M2  values  falling 
within  this  grid  cell  are  then  marked  and  counted 
The  time  required  to  execute  this  simple  procedure  is 
O (ah),  where  a  is  the  grid  size  and  h  is  the  NON- 
VON  tree  height  Counts  that  exceed  the  threshold 
value  are  the  parameter  values  being  sought  A  large 
percentage  of  the  time  in  this  procedure  is  spent 
counting  votes  in  grid  cells  corresponding  to  areas 
that  contain  few  votes 

We  are  currently  simulating  a  different  approach,  in 
which  areas  of  non  voting  activity  are  not  considered 
in  locating  the  local  peaks  of  the  two-dimensional 
histogram  The  procedure  first  computes  a  one- 
dimensional  histogram  of  the  parameter  M2,  as  shown 
in  Figure  2  (A  pipelined-SIMD  algorithm  to  compute 
the  one-dimensional  histogram  is  described  in 
[Ibra  84c]  )  A  small  number  of  local  peaks 

corresponding  to  regions  of  the  two-dimensional 

histogram  where  most  of  the  votes  occur,  appear  in 
the  one-dimensional  histogram  Only  votes  in  those 
regions  are  then  marked  A  second  one-dimensional 
histogram  of  'the  parameter  Ml  is  then  computed  for 
the  marked  votis  only.  The  local  peaks  of  this 

histogram  are  t’le  values  of  Ml,  for  which  there  are 
local  peaks  in  the  two-dimensional  histogram  The 

values  of  Ml  and  M2  for  which  exist  local  peaks  of 
the  two  one-dimensional  histograms  mark  the  regions 
of  activity  in  the  two-dimensional  histogram  These 
regions  of  active  voting  are  then  checked  for  exact 
vote  counts  Round  off  errors  in  computing  the 
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parameter  values  can  result  in  peaks  that  are 
relatively  flat  For  this  reason,  a  small  window 
around  the  regions  of  voting  activity  should  also  be 
checked  when  counting  the  exact  votes  This  second 
approach  executes  in  time  of  O (ml  +  m2  +  h), 
where  ml  and  m2  are  the  number  of  bins  in  the  two 
one-dimensional  histograms  The  computation  of  a  64 
bins  one-dimensional  histogram  requires  about  one 
msec  The  algorithm  for  locating  the  local  peaks  in 
the  two-dimensional  histogram  of  the  parameter  values 
as  described  earlier  executes  in  about  5  msec  The 
total  execution  time  of  the  second  approach  is  thus 
about  5  3  msec,  which  is  considerably  less  than  the 
time  required  by  the  first  approach  (50  msec  for  1000 
boundary  points) 

The  algorithms  described  here  can  be  extended  using 
slight  modifications  to  deal  with  parameter  spaces  of 
higher  dimensions  For  example,  in  the  first  approach 
if  we  have  an  n-dimensional  parameter  space,  then 
each  PE  will  correspond  to  a  n-dimenstona!  grid  cell 
in  this  space  In  the  second  approach,  the  subtree 
size  will  correspond  to  that  of  (n-i)-dimensional  area 
of  the  parameter  space,  and  each  PE  will  store 
parameter  values  that  represent  cells  in  this  sub¬ 
parameter  space  A  second  approach  to  extend  the 
Hough  transform  to  parameter  spaces  of  higher 
dimensions  involves  applying  the  current  algorithms  to 
two-dimensional  cross  sections  of  the  multi-dimensional 
parameter  space 

4  Conclusion 

In  this  paper,  we  have  addressed  the  problem  of 
implementing  the  Hough  transform  method  on  fine¬ 
grained  tree-structured  SIMD  machines  Two 
algorithms  have  been  developed  for  implementing  the 
Hough  transform  method  on  the  NON-VON  machine 
The  first  one  is  a  parallel  implementation  of  the 
standard  sequential  machine  algorithm  The  second 
algorithm  incorporates  novel  approaches  to  exploit  the 
tree  organization  of  the  machine,  and  it  executes  10 
times  faster  than  the  first  algorithm 
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Abstract 

Through  regularization  analysis,  ill-posed  visual  reconstruction  prob¬ 
lems  can  be  restated  as  well-posed  variational  principles  whose  stabi¬ 
lizing  functionals  inqiosc  smoothness  constraints  on  possible  solutions 
T.ncountcrs  with  discontinuities  present  serious  difficulties  to  standard 
regularization  methods  however,  since  the  smoothness  properties  of 
stabilizing  functionals  must  be  regulated  appropriately  near  known 
singularities  After  examining  the  close  connections  between  regulariza¬ 
tion  and  generalized  spline  aoproximalion.  we  propose  a  class  of 
conlrollciTsniootlincss  stabilizers  which  overcome  potential  difficulties 
with  discontinuities 


1.  Introduction 

1.1.  Reconstruction,  Splines,  &  Regularization 

Karly  vision  can  be  described  as  the  reconstruction  of  physically- 
based  intrinsic  scene  characteristics  from  images  [Barrow  and 
icnenbaum,  1978;  Marr,  198’).  Included  among  early  visual 
reconstruction  problems  are  the  recovery  of  lightness  from  image 
iitadiancc,  motion  fields  from  time-varying  images,  and  disparity 
fields  1'iom  binocular  images.  An  intermediate  reconstruction  ptoblcin 
concerns  die  integration  of  this  information  with  constraints  derived 
from  shading,  texture,  and  contours  to  compute  visible-surface 
representations  (Tcr/opoulos,  1984). 

Intending  the  theoretical  aspects  of  our  work  on  surface 
reconstruction,  we  have  been  able  to  unify  a  broad  range  of 
visual  reconstruction  problems  in  terms  of  well-posed  (quadratic) 
variational  principles  in  Sobolev  spaces  (see  1 1  cr/opoulos,  I982| 
section  6).  The  variational  principles  define  constrained  spline 
approximation  problems  which  involve  a  particular  class  of  generalized 
multidimensional  spline  functionals  whose  properties  have  been 
studied  extensively  by  Duchon  [1976.  1977),  Mcinguct  [1979[.  and 
others. 

As  is  generally  the  case  for  inverse  mathematical  problems, 
visual  reconstruction  problems  tend  to  be  ill-posed,  in  that  existence, 
uniqueness,  and  stability  of  solutions  cannot  be  guaranteed  in  the 
absence  of  additional  constraints.  This  important  observation  was 
made  oy  Poggio  and  Torre  [1984],  who  further  point  out  that 
systematic  approaches  to  the  solution  of  ill-posed  problems  can 
be  exploited  in  early  vision.  They  investigate  in  this  context  the 
regularization  methods  introduced  by  Tikhonov  [1963)  and  others 
(see  1 1  ikhonov  and  Arsenin.  I977|  and  references  therein). 

Through  regularization  analysis,  ill-posed  visual  problems  can 
be  restated  as  weil-poscd  variational  principles  by  the  introduction  of 
appropriate  stabilizing  functionals,  notably  die  stabilizers  suggested 
by  Tikhonov  (Tikhonov  and  Arsenin,  1977.  pp.  69-70).  Tikhonov’s 
si  ibili/ers  can  be  viewed  as  spline  functionals  that  impose  smoothness 
Constraints  on  Ihc  admissible  solutions  (typically  restricting  them  to 
Sobolev  spaces  of  smooth  Inactions).  Pragmatically  then,  generalized 
spline  approximation  turns  out  to  lie  essentially  equivalent  to 


regularization  analysis.  By  exploring  this  relationship  in  some  detail, 
our  earlier  work  can  be  shown  to  lie  fully  compatible  with  Poggio 
and  Torre’s  promising  new  regularization  framework  for  early  vision 
(also  [Poggio.  el  al.,  1984]). 

1.2.  Discontinuities  &  Regularization 

Smoothness  constraints  arc  applicable  to  the  regularization  of 
visual  reconstruction  problems  inasmuch  as  the  coherence  of  matter 
tends  to  produce  smooth  surfaces  relative  to  the  viewing  distance  over 
some  range  of  scales.  An  inevitable  complication  arises,  however,  due 
to  fire  necessity  of  dealing  with  discontinuities  in  the  intrinsic  properties 
of  scenes.  Discontinuities  result  from  significant,  spatially  kviH/ed 
physical  changes  in  die  world,  such  as  abrupt  changes  in  s  -face 
geometry  (c.g.  occlusions),  abrupt  alterations  of  surface  composition 
(c.g.  texture),  or  abrupt  transitions  of  illumination  (c.g.  shadows). 

I  )t  continuities  lend  to  be  spatially  organized  along  contours  in  dtc 
image,  especially  when  dicy  arc  due  to  surface  geometry  changes. 
Some  discontinuity  contours  can  persist  across  all  scales. 

Smoothness  assumptions  clearly  do  not  hold  indiscriminately 
across  discontinuity  contours.  In  this  paper,  we  propose  a  way  to 
manipulate  the  smoothness  properties  of  multidimensional  Tikhonov 
stabilizers  over  the  spatial  domain.  We  show  dial  smoothness  can  be 
regulated  locally  to  preserve  discontinuities,  through  the  specification 
of  a  set  of  parametric  functions  provided  by  the  stabilizer. 

1.3.  Ill-Posed  Nature  o»  Visible-Surface  Reconstruction 

The  basic  ideas  derive  from  a  formal  treatment  of  surface  discon¬ 
tinuities  in  the  context  of  visible-surface  reconstruction  (Tcr/opoulos, 
1983b.  1984).  Pie  task  is  to  reconstruct  dense  representations  of 
the  shapes  of  visible  surfaces  from  initial  measurements  of  surface 
depth  and  orientation  (as  well  as  their  discontinuities).  Such  initial 
measurements  arc  provided  by  numerous  specialized  low-level  uisu?.l 
processes. 

l.ct  die  true  distance  at  each  point  (r,  y)  in  the  image  front 
the  viewer  to  visible  surfaces  be  given  by  the  function  7.{x,  y).  I  his 
function  is  often  referred  to  as  die  depth  map  of  the  scene.  In 
general,  low-level  visual  modules  generate  initial  data  of  the  form 

<f,  =  l,.[X(x,y)\  + 

where  denote  measurement  functionals  of  7(x,y)  and  t,  denote 
associated  measurement  errors.  The  simplest  measurement  functionals 
are  evaluation  functionals,  which  provide  local  depth  data  of  the 
form 

U\Z(x, »)]  --=  Z(xi,  gi)  = 

and  derivative  functionals,  which  provide  die  local  gradient  data 

tt))  -=«,(*„»,)  =  p,..v. 

hcr.ee.  the  local  surface  normal  n(x,  y)  -  (/,(/,  y).  Zt(x,  y),  - 1).  Other 
functionals  suc  h  as  directional  derivatives  may  also  play  a  role,  and 
they  can  he  accommodated  straightforwardly. 

Visible-surface  reconstruction  is  a  mathematically  ill-posed  visual 
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problem,  ns  suggested  (infommlly)  by  (he  ‘ollowmg  considerations' 
i  irsr  the  initial  measurements  arc  contributed  not  by  one,  but 
bv  multiple  specialized  low-level  visual  processes,  lienee  slightly 
inconsistent  measurements  (provided  by  dill'crent  low-level  prticesscs) 
that  happen  to  coincide  will  locally  overdcicnninc  surface  shape 
Second,  the  measurements  are  not  dense,  but  •.•altered  snarsclv  over 
the  visual  field.  I  hus.  while  they  constrain  surface  shape  locally,  they 
do  not  determine  it  uniquely  everywhere.  Third,  the  measurements 
are  not  precise,  but  subject  to  errors  and  noise.  Indeed,  additive  noise 
of  high  enough  frequency,  regardless  how  small  its  RMS  amplitude, 
can  locally  perturb  surface  orientation  radically. 

Visible-surface  reconstruction  is  thus  a  fundamentally  ill- 
posed  problem,  since  according  to  the  above  three  considerations 
(respectively)  we  cannot  conclude  in  general  that  die  solution  will 
exist,  nor  dial  it  will  be  unique,  nor  that  it  will  be  stable  with  respect 
to  perturbations  in  the  data. 

Given  its  ill-posed  nature,  die  this  visual  problem  can  be 
approached  via  regularization  mediods  In  this  context,  the  thin  plate 
surface  under  tension,  introduced  in  [Tcrzopoulos,  1984]  as  a  physical 
model  for  the  reconstructed  surface,  can  be  readily  interpreted  as 
a  controlled- smoothness  stabilizer.  We  will  explain  shortly  how  diis 
special  ease  of  generalized  controllcd-smoothncss  splines  enables  the 
surface  reconstruction  problem  to  be  regularized,  even  in  die  presence 
of  discontinuities. 

2.  Optimal  Approximation  with  Generalized  Splines 

ITic  abstract  theory  of  optimal  spline  approximation  is  well- 
developed  and  a  close  connection  has  been  established  with  variational 
principles  involving  die  constrained  minimization  of  semi-norms  in 
Hilbert  function  spaces  [laurcnt.  1972],  This  connection  leads  to 
natural  multidimensional  generalizations  of  the  classical  univariate 
splines  [Ahlbcrg  ct  at.,  1967]  (not  as  tensor  products,  but  radier  dirough 
physical  interpretations  concerning  equilibria  of  clastic  bodies).  In 
d::s  section,  we  consider  a  class  of  generalized  spline  approximation 
problems  and  indicate  dieir  close  relationship  to  regularization  analysis. 

l  et  X  by  a  linear  space  of  smooth  functions  and  let  S  be  a 
functional  defined  on  It  which  measures  the  smoothness  of  a  function 
in  H.  furthermore,  let  C  be  a  functional  on  M  which  provides  a 
measure  of  the  incompatibility  between  the  function  and  given  data. 
Consider  die  following  variational  principle: 

VP:  Find  u  e  M  such  that 

S(n)  =  inf  £(v), 

«€* 

where 

e  (v)  =  S(v)  +  C(v). 

litis  defines  an  approximation  problem  —  to  find  die  smoothest 
admissible  function  in  the  space  M  which  is  most  compatible  with 
the  data. 

Assuming  independently  distributed  measurement  errors  «,  with 
zero  means  and  variances  a?,  the  natural  incompatibility  measure  is  a 
weighted  Hudidcan  norm  of  the  discrepancy  between  die  admissible 
function  and  the  data  </,.  'ITiis  can  be  written  as 

CM 

i 

where  the  w<  arc  nonnegative  real-valued  weights.  The  basic 
measurement  functionals  of  immediate  concern  arc  generalized 

derivative  functionals  of  the  form  /, \%\  =  -.,.***  ,  I  Note 

. lx. 

that  for  k  =  0.  L  reduces  to  die  evaluation  functional  L\X]  —  7f(x,). 

2.1.  Generalized  Spline  Functionals 

I  Inchon  |1976.  1977]  and  Mcinguci  |1979]  study  the  following 
class  of  generalized  smoothness  functionals,  defined  for  d-dimensional 
functions  «(x).  x  =  |x, . **]: 


111  esc  functionals  hue  several  interesting  properties  (Mcinguct. 
1979|.  Since  the  variational  po  oplc  reduces  in  die  one-dimensional 
case  (J  =  I)  to  the  approximation  problem  associated  with  classical 
smoothing  splines  |Schocnberg.  1964;  Kcinsch.  1967],  the  functionals 
S,„{v)  can  be  thought  of  as  generating  multivariate  generalized  splines. 
I  he  positive  integer  m  dictates  the  order  of  the  partial  derivatives 
that  occur  in  die  functional,  which  in  turn  determines  the  order  of 
continuity  that  the  admissible  functions  v  must  possess. 

In  fact,  die  functionals  define  the  natural  semi-norms  of  licppo- 
l.cvi  spaces  (which  arc  related  to  Sobolev  spaces).  Ihese  semi-norms 
are  invariant  under  translation,  rotation,  and  similarity  transformation, 
invariance  properties  which  prove  essential  in  the  context  of  visual 
reconstruction  problems  (since  die  solutions  to  visual  reconstruction 
problems  should  not  change  shape  when  objects  in  die  scene  translate 
or  rotate  parallel  to  the  image  plane,  or  when  they  approach  or  retreat 
parallel  to  die  view  direction  lllrady  and  Horn,  1983:  Tcrzopoulos. 
1982]).  Ihe  null-spaces  M  of  die  semi-nouns  arc  simply  die  M  = 

(  *"  ')  dimensional  spaces  of  all  polynomials  over  X'1  of  degree  less 
dian  or  equal  lom-l  (Schwartz,  1966,  p.  60], 

Under  certain  conditions  f(v)  becomes  a  norm  in  II,  which 
guarantees  existence,  uniqueness,  and  stability  of  the  solution  u  to 
the  variational  principle.  A  possible  set  of  conditions  is  that  die  /,, 
include  evaluation  functionals  at  an  jV -unisolvent  set  of  points  (i.e.,  a 
set  of  III  points  which  define  a  unique  polynomial  in  the  null  space 
of  die  smoothness  functional). 

Several  important  visual  reconstruction  problems  can  be  cast  as 
spline  approximation  problems  according  to  die  foregoing  variational 
formulation  (Tcrzopoulos.  1982,  1984],  An  equivalent  point  of  view 
is  that  the  ill-posed  reconstruction  problems  arc  regularized  by 
introducing  a  stabilizing  functional  which  restricts  admissible  solutions 
to  the  class  of  generalized  splines. 


2.2.  Multidimensional  Tikhonov  Stabilizers 


In  fact,  it  is  readily  possible  to  make  an  explicit  connection 
between  Tikhonov’s  labilizcrs  and  die  generalized  spline  functionals. 
Tikhonov's  stabilizer  of  p-di  order  is  die  weighted  Sobolev  norm 


where  me  X,„(z)  arc  given  nonnegalivc  continuous  weighting  functions 
(likhonov  and  Arsenin.  1977,  pp.  69-70;  I’oggio  and  Torre,  cq.  5], 
I  hc  natural  multidimensional  generalization, 


£/,.>■(*>  £ 


can  be  identified  as  a  weighted  summation  of  generalized  spline 
functionals.  If  die  smoothness  functional  .S'(«)  in  the  variational 
principle  is  taken  to  be  die  l  ikhonov  stabilizer,  the  admissible  space 
h  becomes  the  p-lh  order  Sobolev  space. 


3.  Surface  Approximation  with  Thin  Plate  Splines 


In  this  section,  we  restrict  our  discussion  of  generalized 
approximation/rcgularization  problems  to  two  dimensions.  Note  mat 
in  two  dimension^  (d  =  2)  the  generalized  spline  functionals  can  be 
written  as 


H  •-/l±CXz&)’*+ 


Here  they  pertain  directly  to  the  problem  of  fitting  surfaces  to  scattered 
data.  Ibis  mathematical  problem  is  of  considerable  concern  in  many 
application  areas  [Schumakcr,  1976],  and  notably  in  vision  to  visible- 
surface  reconstruction.  Of  the  methods  reviewed  by  Schumakcr, 
the  spline  approximation  techniques  are  of  direct  interest  in  our 
discussion. 


An  appealing  spline  approximation  technique  involves  fitting  to 
given  data  a  surface  function  u(x,y)  which  minimizes  the  functional 

*»(«)  =  f  f  « i«  +  2vl,  +  u\,dxdy 
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(r.ee  [Sehunia.  cr.  1976.  section  3.4)).  1'hc  functional  will  be  recognized 
is  an  instance  of  the  two-dimensional  generalized  splines  for  the  ease 
m  —  2.  It  turns  out  that  measures  the  (small  deflection)  bending 
energy  of  a  thin  plate  (with  zero  Poisson  ratio)  (Courant  and  I  lilbcr.. 
1953).  lienee.  Duchon  (1976)  refers  die  surfaces  which  minimize  Sjtu) 
as  thin  plate  splines. 

Thin  plate  splines  have  seen  several  applications,  including 
computer  design  of  aircraft  wings  (Harder  and  Desmarais,  1972). 
automatic  generation  of  digital  terrain  maps  (Briggs,  1974).  and 
variational  analysis  of  meteorological  fields  [Wahba  and  Wcndclbcrgcr, 
19801. 

In  application  to  vision,  Grimson  (1981,  1983]  employed  S2(u) 
for  die  interpolation  of  visual  surfaces  from  sparse  disparity  data 
(he  referred  to  it  as  the  "quadratic  variation").  'Hie  relationship  to 
diin  plates  was  rediscovered  by  [Brady  and  Horn.  1983).  and  this 
physical  model  was  developed  further  in  flcrzopoulos,  1982,  1983a], 
where  efficient,  mullircsolulion  surface  reconstruction  algorithms  arc 
devised. 

3.1.  Discontinuities  &  Thin  Plate  Surfaces  Under  Tension 

Hie  above  applications  demonstrate  die  basic  usefulness  thin 
plate  spline  surface  approximation  in  the  absence  of  discontinuities. 
ITeficicncics  of  diin  plate  splines  near  discontinuities  are  particularly 
evident  in  the  vision  applications,  however.  I  lie  papers  by  Grimson  and 
Tcrzopoulos  contain  a  number  of  examples  in  w  hich  die  reconstructed 
surface  smoodics  inappropriately  across  occluding  boundaries  and 
creases. 

In  die  vicinity  of  discontinuities,  a  thin  plate  surface  attempts 
to  follow  the  sudden  changes  in  the  data,  while  it  is  simultaneously 
attempts  to  maintain  its  characteristic  smoothness.  It  is  consequently 
forced  to  overshoot  die  data  points,  this  causes  spurious  inflection 
points  (lienee,  ripples)  near  discontinuities.  Ibis  phenomenon  also 
manifests  itself  in  classical  spline  interpolation  (Ahlbcrg  el  ai.  1967). 

Schwcikcrt[1966]  introduced  splinesundcr  tension,  which  imitate 
the  behavior  of  cubic  interpolating  splines,  while  suppressing  die 
extraneous  inflection  points,  which  sometimes  afflict  cubic  splines 
(see,  also.  (Cline.  1974)).  Intuitively,  tension  can  eliminate  extraneous 
loops  and  ripples  by  reducing  the  length  of  the  spline.  Nielson 
|1974|  characterizes  splines  under  tension  as  intcrpolatory  functions 
that  minimize  the  functional  /*  (c^v/e)*’)*  tlx  +  a*  (dv/Ox)2  tlx, 
where  a  is  a  constant  to  be  chosen,  called  die  tension.  The  first 
term  influences  smoothness,  while  die  second  term  influences  length. 
Ihc  basis  functions  for  splines  under  tension  involve  exponentials. 
Nielson  developed  i '-splines,  piecewise  polynomial  alternatives,  one 
of  whose  advantages  is  that  the  tension  can  be  set  selectively  at  each 
interpolation  point  (sec.  also.  |Barsky  and  Beatty,  1983)). 

Pilcher  [1974|  suggested  that  the  idea  of  splines  under  tension 
can  be  carried  over  to  surfaces.  His  proposal  involved  restricting  the 
area  of  a  tensor  product  of  polynomial  splines,  and  he  characterized 
this  as  a  minimization  problem  whose  solutions  yield  surfaces  which 
he  described  as  "clastic  skins." 

In  flcrzopoulos.  1984]  we  proposed  a  new  model  for  visible 
surfaces  —  the  natural  (physical;  generalization  of  splines  under 
tension  to  surfaces.  Unlike  Pilcher's  Mastic  skins,  our  model  involves 
no  tensor  products.  The  idea  is  to  restrict  the  area  of  a  thin  plate 
spline  by  introducing  surface  tension. 

It  is  well-known  that  membrane  surfaces,  in  equilibria,  minimize 
the  functional 

S|(v)  =  J  J  v\  +  v\dzdy, 

which  is  die  small  deflection  approximation  of  the  surface  area  of  v 
(Courant  and  Hilbert,  1953]  (minimization  of  the  true  surface  area 
leads  to  the  famous  Plateau's  problem).  S,(t>)  will  be  recognized  as 
another  instance  of  the  two-dimensional  generalized  spline,  this  time 
for  m  =  1.  Naturally,  the  membrane  spline  exhibits  a  lower  order  of 
smoothness  than  die  thin  plate  spline. 

Surface  area  can  be  restricted  by  combining  the  membrane 
spline  functional  Si(v)  with  the  thin  plate  spine  functional  S2(«).  We 
propose  the  convex  combination 


S,(v)  —  J  j(\  -  r)&(t>)  s  rS,(v)dzdy, 

where  die  real  valued  parameter  r  e  (0, 1 )  controls  the  tension.  In  view 
of  the  physical  interpretation,  solutions  to  die  variational  principle 
involving  this  functional  will  be  referred  to  thin  plate  surfaces  uiuler 
tension. 

As  r  approaches  0.  the  diin  plate  surface  under  tension  tends 
towards  a  normal  thin  plate  spline,  whereas  as  r  approaches  I. 
the  surface  area  is  increasingly  restricted  and  the  surface  tends 
toward  a  membrane  spline.  Intermediate  values  of  r  characterize  a 
hybrid  surface  dial  amalgamates  properties  of  the  thin  plate  and  the 
membrane.  An  appropriate  value  for  the  tension  parameter  can  lead 
to  advantages  for  surface  reconstruction  analogous  to  those  offered  by 
splines  under  tension  over  normal  splines,  l  or  example,  overshoots 
can  be  controlled,  but  at  the  cost  of  concentrating  the  curvature  of 
the  reconstructed  surface  near  die  constraints. 

4.  Controlled-Smoothness  Stabilizers 

Going  one  step  further,  die  tension  parameter  in  S,(vj  can 
be  made  a  nonncgalivc  parametric  function  of  two  variables  r  = 
r(x, j/).  Ihis  allows  the  tension  to  be  manipulated  over  the  domain 
of  integration.  Away  from  dcpdi  discontinuities,  we  can  assign 
t[x,  y)  -  0  so  dial  the  surface  .lets  as  a  thin  plate.  Along  orientation 
discontinuities,  we  can  assign  r(x.  y)  =  I  so  dial  the  surface  acts  as 
a  membrane,  dins  creasing  freely.  The  smoothness  functional  S,(v) 
is  deactivated  entirely  at  occluding  boundaries,  where  discontinuities 
in  dcpdi  occur. 

Fig.  1  shows  examples  of  discontinuity  preserving  surface 
reconstructions  using  the  thin  plate  surface  under  tension  model. 
Interpreting  the  thin  plate  surface  under  tension  as  a  stabilizing 
functional,  we  see  dial  it  provides  local  control  over  its  smoothness 
properties.  Specifically,  it  enables  the  reconstructed  surfaces  to  "break” 
locally  at  known  depth  discontinuities  and  "crease"  locally  at  known 
orientation  discontinuities.  We  will  refer  to  stabilizers  of  diis  type  as 
conlrollctl-smootlmess  stabilizers. 

Conlrollcd-smoothncss  stabilizers  can  be  generalized  beyond 
the  ease  of  thin  plate  surfaces  under  tension.  Recall  that  the  order 
m  of  the  generalized  splines  S,„(v)  dictates  die  continuity  of  die 
admissible  functions  v.  We  mentioned  that  for  die  case  m  =  1  die 
solution  can  be  characterized  as  a  membrane  spline.  Physically,  the 
membrane  defines  a  continuous  surface  which,  however,  need  not 
have  continuous  first  (and  higher)  order  partial  derivatives.  Similarly, 
for  the  ease  m  =  2  the  solution  can  be  characterized  as  a  thin  plate 
spline.  This  defines  a  continuous  physical  surface  with  continuous 
first  partial  derivatives,  which  need  not  have  continuous  derivatives  of 
degree  greater  dian  one.  In  general,  physical  solutions  corresponding 
to  Sm  are  functions  possessing  C"'~‘  continuity. 

This  suggests  a  convenient  way  of  controlling  the  smoothness 
properties  of  a  generalized  spline  stabilizer  of  order  m  —  by 
combining  it  with  splines  of  orders  less  than  m.  As  we  have 
shown,  the  multidimensional  T  ikhonov  stabilizers  do  involve  such 
compositions  and.  indeed,  dicrc  is  no  reason  why  the  parametric 
functions  X,„(x)  cannot  play  the  same  role  as  the  tension  r(x,  y).  The 
functions  can  then  be  used  to  control  the  blending  proportions  of  each 
component  generalized  spline  S„.  thus  regulating  the  smoothness  of 
the  Tikhonov  stabilizer. 

Both  the  spline  under  tension  and  die  thin  plate  surface  under 
tension  arc  special  cases  of  die  generalized  comrolled-smoothncss 
stabilizers,  for  the  eases  d  =  I,  p  =  2.  and  d  =  2,  p  =  2  respectively. 
As  a  further  example,  consider  the  ease  d  —  2,  p  =  3.  If  all  the 
X,„(x,y)  arc  nonnegative  functions,  the  solution  specifics  a  surface 
spline  which  is  constrained  to  have  continuous  second  order  partial 
derivatives  (i.e.  continuous  curvature).  However,  it  is  possible  to 
nullify  this  smoothness  constraint  by  setting  \3(x,y)  —  0  at  known 

curvature  discontinuities  in  the  z,y  plane.  This  effectively  reduces 
the  functional  to  a  diin  plate  spline  locally,  which  need  not  have 
continuous  curvature.  Analogously.  X3(x,  y)  and  X2(x,y)  can  both  be 
set  to  zero  at  known  orientation  discontinuities.  Finally,  all  three  X 
functions  can  be  set  to  zero  at  known  depth  discontinuities. 
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Figure  I.  Discontinuity  preserving  surface  reconstructions  with  'he  til  in  plate 
surface  under  tension.  (Top)  Reconslrucuon  from  scattered  synthetic  depth 
data  with  depth  disconunuities.  (Center)  Reconstruction  from  synthetic 
orientation  data  showing  depth  and  orientation  discontinuities.  (Bottom) 
Reconstruction  of  a  light  bulb  from  structured  light  data  showing  depth 
discontinuities. 


4.1.  Regularization  Analysis  &  Discontinuity  Detection 

Comrolled-smoothncss  stabilizers  arc  applicable  to  ill-posed 
visual  reconstruction  problems  in  which  the  locations  and  orders  of 
discontinuities  in  die  solutions  are  known  or  can  be  determined  in 
advance.  An  interesting  question  arises:  Is  it  possible  to  determine 
the  location  of  discontinuities  as  an  integral  part  of  regularization 
analysis?  We  offer  some  preliminary  ideas  on  how  o  do  this  in  view 
of  the  development  in  this  paper. 


The  detection  of  discontinuities  can  be  incorporated  formally 
into  a  variational  principle  whose  stabilizing  functional  S(v)  is  a 
comrolled-smoothncss  stabilizer.  Ihc  basic  problem  of  finding  the 
optimal  solution  u  is  then  augmented  with  the  additional  requirement 
to  find  p  +  I  optimal  parametric  functions  Xm(x),  m  =  0,1,  ,  p, 

which  determine  the  locations  of  discontinuities  up  to  order  p. 

This  larger  problem  is  made  more  difficult  by  the  fact  that 
it  is  non-convcx.  One  way  to  approach  it  is  by  solving  a  chain  of 
easier  convex  problems,  where  die  (improving)  solution  at  each  stage 
becomes  data  for  the  subsequent  problem.  Ibis  is  essentially  the 
discontinuity  detection  approach  implemented  in  |T  crzopoulos,  1984). 
lilakc  |1983|  describes  an  alternative  graduated  non-convexity  method 
for  finding  discontinuities  in  images.  He  has  implemented  it  for 
the  highly  restricted  case  of  jump  discontinuities  between  piecewise 
constant  regions.  While  these  approaches  arc  not  guaranteed  to  yield 
die  true  global  optimum,  they  typically  produce  good  local  optima. 
A  possibility  for  finding  die  global  optimum  is  the  annealing-based 
technique  for  finding  discontinuities  described  by  [Genian  and  Genian, 
1984|.  Ibis  approach,  while  computationally  expensive  on  sequential 
machines,  nonetheless  appears  to  be  effective  [Marroquin,  1984). 

Optimality  criteria  for  the  parametric  functions  should  be  made 
part  of  the  energy  functional  /•,'(«).  A  necessary  criterion  restricts 
die  number  of  discontinuities  placed,  and  other  potentially  useful 
criteria  for  placement  of  discontinuities  may  include  a  preference  for 
piecewise  smooth  discontinuity  contours.  For  the  two-dimensional 
case  of  surfaces,  a  natural  formulation  for  the  optimal  placement 
of  discontinuities  may  naturally  reduce  to  a  one-dimensional  ill- 
posed  reconstruction  problem.  In  such  a  case,  analogous  controllcd- 
smoolhncss  regularization  can  be  applied  in  a  single  dimension, 
yielding  a  one-dimensional  spline  problem. 

5.  Conclusion 

Classical  regularization  methods  can  be  deficient  in  application 
to  ill-posed  visual  reconstruction  problems  involving  discontinuities. 
To  remedy  the  situation,  the  smoothness  properties  of  standard  spline 
stabilizing  functionals  must  be  relaxed  locally  at  discontinuities. 

Kxploring  the  close  connection  of  regularization  analysis  and 
generalized  spline  approximation,  wc  were  led  to  interpret  the 
natural  multidimensional  generalizations  of  Tikhonov's  stabilizers  as 
comrolled-smoothncss  splines.  Ihc  smoothness  properties  of  diese 
combinations  of  generalized  spline  functionals  can  be  regulated  locally 
through  a  set  of  parametric  functions.  Ihc  controllcd-smoolhncss 
spline  functionals  encompass  the  special  cases  of  classical  splines 
under  tension,  as  well  as  diin  plate  surfaces  under  tension.  Ihc 
latter  were  developed  as  physical  models  in  our  recent  work 
on  visible-surface  reconstruction,  a  fundamentally  ill-posed  visual 
problem.  Controllcd-smoothncss  rcgularizers  suggest  possibilities  for 
incorporating  discontinuity  delation  into  regularization  analysis.  This 
is  a  challenging  problem  dial  is  currently  under  investigation. 
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Abstract 

A  method  is  presented  for  transforming  a  set  of  line  draw¬ 
ings  of  a  polyhedral  scene  into  a  representation  that  embodies 
the  three-dimensional  structure  of  the  scene.  The  line  draw¬ 
ings  are  first  converted  to  machine-readable  form  and  then  back- 
projected  to  acquire  a  wire  frame  skeleton  of  the  scene.  A 
novel  three-dimensional  constraint  propagation  scheme  is  then 
employed  to  transform  the  wire  frame  to  a  description  of  the  solid 
objects  which  compose  the  scene.  This  process  baa  applications 
in  computer-aided  design  as  well  as  in  machine  understanding  of 
multiple  images.  The  paper  concludes  w  ith  a  discussion  of  issues 
related  to  achieving  the  same  result  from  a  single  view. 

1.  Introduction 

Machines  that  must  reason  about  or  function  in  a  three- 
dimensional  world  must  be  equipped  with  models  of  objects  in 
that  world  A  multitude  of  representations  has  been  devised 
for  three-dimensional  models  [I).  yet  the  specification  of  individ¬ 
ual  models  can  be  a  tedious  undertaking.  This  paper  examines 
methods  for  computing  a  three-dimensional  model  of  a  particu¬ 
lar  class  of  objects  from  a  particular  form  of  input — polyhedral 
objects  from  hue  drawings. 

Researchers  in  computer-aided  design  have  produced  numer¬ 
ous  systems  that  manipulate  models  of  solid  objects  to  assist  in 
the  design,  analysis,  or  fabrication  of  everything  from  machine 
parts  to  factories.  The  act  of  specifying  a  model  is  one  of  the 
most  difficult  tasks  associated  with  these  systems. 

In  an  interactive  image-understanding  system,  there  are  sev¬ 
eral  sources  of  line  draw  ings.  One  can  envision  a  very  competent 
line-finder  that  automatically  extracts  the  line  drawings  of  se¬ 
lected  objects.  Alternatively,  the  user  can  specify  the  lines  in  an 
image  by  pointing  at  their  endpoints  with  a  mouse  or  other  in¬ 
put  device.  A  third  possibility  is  for  the  user  to  draw  the  figures 
freehand  or  with  mechanical  assistance.  Whatever  the  means  of 
entry,  the  objective  is  to  produce  a  three-dimensional  sketch  that 
captures  the  volumetric  nature  of  the  objects. 

This  paper  is  concerned  with  deriving  the  representation  ge¬ 
ometrically.  as  opposed  to  using  model-based  representations.  It 
is  divided  into  two  parts:  The  first  presents  an  algorithm  for  de¬ 
riving  a  volumetric  description  of  a  polyhedral  scene  when  mul- 
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tiple  views  are  available;  the  second  part  explores  the  problem  of 
accomplishing  this  task  when  only  one  line  drawing  is  available. 

2.  Multiple  Views 

The  algorithm  to  be  described  solves  for  a  tbree-dimeusional 
description  of  a  scene  when  several  views  are  available.  The  over¬ 
all  process  can  be  thought  of  as  accepting  a  set  of  line  draw  ings  of 
a  scene  as  input  and  providing  as  output  a  display  of  the  object 
from  any  angle,  with  all  hidden  lines  removed.  The  algorithm 
consists  of  four  sequential  modules  called  input,  projection,  wire 
frame,  and  display. 

The  required  input  is  a  set  of  two  or  more  line  drawings 
and  the  angular  relationships  among  them.  A  line  drawing  is 
restricted  to  be  the  projection  (orthographic  or  perspective)  of  a 
polyhedron  from  a  particular  vantage  point  and,  as  a  result,  is  a 
collection  of  straight  line  segments. 

The  input  module  is  responsible  for  producing  a  data  struc¬ 
ture  that  specifies  the  positions  of  all  lines  and  their  endpoints 
in  a  line  drawing.  Its  actual  form  will  vary  with  the  source  of 
the  line  drawing,  as  different  input  processes  dictate  different 
procedures  for  constructing  the  data  structure. 

The  projection  module  computes  the  three-dimensional  co¬ 
ordinates  of  vertices  and  edges  that  may  have  given  rise  to  the 
endpoints  and  lines  in  the  drawings.  The  output  of  the  projection 
module  is  in  the  form  of  a  three-dimensional  wire  frame,  w  hich 
is  represented  as  a  list  of  vertices  and  edges.  The  computation 
is  carried  out  by  back-projecting  the  points  in  each  line  drawing 
and  determining  their  points  of  intersection. 

Next  in  the  pipeline  is  the  wire  frame  module,  which  is  the 
most  interesting  of  the  four.  Its  task  is  to  derive  the  solid  ob¬ 
ject  that  corresponds  to  the  given  wire  frame.  It  employs  a 
Waltz-style  constraint  propagation  scheme  |S).  but  differs  sig¬ 
nificantly  by  assigning  labels  to  spatial  regions  and  propagating 
them  throughout  the  three-dimensional  structure,  in  contrast 
with  propagation  across  a  two-dimensional  line  drawing.  Only 
two  labels  are  allowed  (SOLID  and  HOLE),  and  a  consistent  la¬ 
beling  is  usually  achieved  very  quickly. 

The  display  module  uses  the  labeled  output  of  the  wire  frame 
module  to  produce  a  display  of  the  object,  with  hidden  lines 
removed.  As  will  be  seen  shortly,  the  hidden-line  algorithm  is 
somewhat  unusual  in  the  way  it  takes  advantage  of  the  label 
information  in  the  wire  frame. 


Figure  1:  Photographs  of  the  Transameriea  Building 


Figure  2:  Line  Drawings  of  the  TVansamerica  Building 


2.1.  The  Input  Module 

Th-  input  module  is  actually  a  set  of  alternative  modules. 
The  one  to  he  used  depends  on  the  source  of  the  line  drawing.  In 
all  eases,  a  line  drawing  is  defined  to  be  composed  of  a  set  of 
line  segments,  to  he  referred  to  as  lines,  and  their  points  of  inter¬ 
section.  to  he  referred  to  as  endpoints.  Lines  are  restricted  to 
intersect  only  at  their  endpoints.  No  curves  are  permitted  since 
the  line  drawing  is  assumed  to  he  the  projection  of  a  polyhedron. 
Furthermore,  all  edges  of  an  object,  visible  or  not,  must  appear 
in  all  line  drawings.  That  is.  hidden  lines,  which  are  normally 
represented  as  dashed  lines  in  engineering  drawings,  are  to  be 
depicted  like  any  other  line,  since  the  algorithm  does  not  accord 
them  any  special  treatment. 

The  most  direct  means  for  specifying  the  line  drawing  would 
be  to  provide  the  coordinates  of  the  endpoints  explicitly.  Another 
procedure  is  envisioned  that  allows  specification  of  line  drawings 
in  a  more  natural  way. 

In  one  scenario,  the  user  will  sketch  or  trace  a  line  drawing 
directly  into  the  system  by  means  of  a  pointing  device,  such  as 
a  mouse  or  graphics  tablet.  Inaccuracies  inherent  in  this  proce¬ 
dure  must  be  resolved  before  the  data  are  passed  to  the  projection 
module.  Sugihara  presents  a  method  for  identifying  incorrect  line 
drawings  and  correcting  vertex  position  errors  [7],  Regardless  of 
the  method  used,  the  input  module  provides  a  data  base  con¬ 


taining  the  endpoints,  lines,  and  viewpoint  for  each  line  drawing. 
As  an  example,  Figure  1  shows  two  images  of  the  Transamer- 
ica  Building  in  San  Francisco.  The  line  drawings  of  Figure  2 
were  obtained  by  tracing  the  edges  of  the  building  with  a  mouse- 
controlled  cursor  The  camera  models  of  each  image  were  com 
pitted  on  the  basis  of  ground  truth  data  obtained  from  a  map  of 
San  Francisco. 

2.2.  The  Projection  Module 

Given  the  data  base  specifying  a  set  of  line  drawings  of  a 
scene  and  the  associated  camera  models,  the  projection  module 
determines  the  wire  frame  of  the  scene  that  could  have  given  rise 
to  those  drawings.  A  wire  frame  is  a  set  of  vertices  and  edges, 
where  an  edge  is  the  intersection  of  two  faces  of  an  object,  and 
a  vertex  is  the  intersection  of  two  edges.  The  algorithm  follows 
closely  that  of  Wesley  and  Markowsky  [9], 

In  the  first  phase,  the  vertices  of  the  wire  frame  are  com¬ 
puted.  Figure  3  illustrates  the  geometry  involved.  .Any  vertex 
of  the  wire  frame  is  constrained  to  lie  on  the  line  connecting  the 
viewpoint  and  the  endpoint  that  is  the  projection  of  the  vertex  in 
the  line  drawing.  Such  lines  are  constructed  for  every  endpoint 
in  every  line  drawing.  The  intersections  of  these  lines  are  com¬ 
puted  and  entered  as  vertices  of  the  wire  freme.  Note  that  this 
procedure  may  find  vertices  that  are  not  in  fact  vertices  of  the 
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Figure  3:  'Hie  geometry  of  backward  projection. 

wire  frame  Some  of  these  vertices  are  identified  in  later  stages  of 
processing  and  discarded;  the  remainder  are  the  result  of  alterna¬ 
tive  legal  interpretations  of  the  line  drawings.  It  is  also  possible 
that  some  real  vertices  may  be  missed,  but,  fortunately,  they 
can  be  found  during  the  second  phase  of  the  projection  module's 
operation. 

In  Phase  two.  the  edges  of  the  wire  frame  are  found.  Two 
vertiees  are  connected  hy  an  edge  only  if  that  edge  is  consistent 
with  all  views  provided.  An  edge  is  consistent  with  a  view  if  that 
edge  projects  to  a  line  in  the  view,  to  a  set  of  continuous  colinear 
lines,  or  to  a  single  endpoint.  When  all  such  edges  have  been 
found,  the  edges  are  eherked  for  internal  intersections.  Any  such 
intersections  are  the  missing  vertices  of  Phase  one  and  are  added 
to  the  data  base,  .lust  as  extra  vertices  may  have  been  found 
earlier,  extra  edges  may  arise  for  the  same  reasons. 

In  Phase  three,  any  vertex  with  fewer  than  three  incident 
edges  is  eliminated.  ( Realizable  solid  objects  always  have  at  least 
three  edges  meeting  at  any  vertex.)  Any  accompanying  edges  are 
also  removed  and  the  pruning  is  continued  until  a  stable  config¬ 
uration  is  reached.  Usually,  however,  there  are  no  vertices  to  be 
removed  in  this  manner. 

At  this  point,  a  wire  frame  has  been  computed  that  is  guar¬ 
anteed  to  encompass  all  the  edges  and  vertices  of  the  object.  If 
the  set  of  line  drawings  provided  determines  the  object  uniquely, 
the  wire  frame  will  correspond  exactly  to  the  wire  frame  of  the 
object.  If  the  line  drawings  are  ambiguous,  the  wire  frame  may 
contain  vertices  and  edges  present  in  one  interpretation  but  ab¬ 
sent  in  others.  The  ambiguous  case  can  be  accommodated  by 
invoking  the  wire  frame  module  for  each  of  the  possible  inter¬ 
pretations.  Those  found  to  be  inconsistent  can  be  disregarded; 
those  found  to  have  a  legal  interpretation  can  be  construed  as 
alternative  solutions.  The  remainder  of  this  paper  assumes  that 
the  wire  frame  has  been  determined  uniquely. 

2.3.  The  Wire  Frame  Module 

The  input  to  the  wire  frame  module  is  a  data  structure  rep¬ 
resenting  a  wire  frame  that  contains  only  edges  and  vertices  that 
correspond  to  true  edges  and  vertices  of  the  underlying  scene. 
The  module's  job  is  to  find  out  which  regions  are  occupied  by 
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Figure  5;  Propagation  of  Intravertex  Constraints. 


solid  matter  and  which  are  not,  relative  to  the  wire  frame.  Its 
basic  tool  for  performing  this  reasoning  is  the  spoku  diagram 
(Figure  4).  The  spoke  diagram  is  an  edge-on  view  of  a  vertex. 
The  spokes  are  the  projections  of  the  edges  at  a  1 vertex  onto  the 
plane  that  is  perpendicular  to  the  selerted  edge.  The  spoke  di¬ 
agram  in  the  figure  is  the  view  along  edge  Fj  toward  vertex  Vj, 
such  that  £j  itself  projects  out  of  the  drawing.  The  sector  be¬ 
tween  two  spokes  represents  the  solid  angle  defined  by  the  two 
edges  corresponding  to  the  two  spokes  and  the  selected  edge. 
The  solid  angle  must  either  be  filled  completely  with  matter  or 
be  completely  void  of  matter,  because  boundaries  between  mat¬ 
ter  and  space  can  occur  only  at  fares  and  all  faces  are  bounded 
by  edges.  Therefore,  each  sector  can  be  labeled  by  cither  SOLID 
or  HOLE  to  reflect  this  choice.  The  task  of  the  wire  frame  module 
is  then  to  assign  a  label  of  SOLID  or  HOLE  to  every  such  solid 
angle,  as  defined  by  the  wire  frame. 

As  mentioned  earlier,  the  wire  frame  module  is  a  constraint 
propagation  algorithm.  Three  separate  processes  serve  to  con¬ 
strain  and  propagate  the  labelings: 

1.  Intravertex  conatratnta — These  serve  to  propagate  la¬ 
bels  (SOLID  or  HOLE)  among  the  spoke  diagrams  at  a 
given  vertex,  as  in  the  example  of  Figure  S.  Here  an  as¬ 
signment  of  SOLID  to  the  sector  between  the  spokes  corre¬ 
sponding  to  edges  Ei  and  Fj,  in  the  spoke  diagram  of  E\  at 
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Figure  6:  Propagation  of  Intervertex  Constraints. 


vertex  l'|.  can  be  propagated  to  constrain  the  other  spoke 
diagrams  at  Ij.  In  the  spoke  diagram  of  edge  Et,  the  sec¬ 
tor  between  edges  Et  and  A'j  must  also  be  labeled  SOLID 
The  same  applies  to  edge  f'j.  Corresponding  sectors  must 
be  labeled  identically  because  they  are  representations  of 
the  same  volume  in  space.  Care  must  be  taken  to  account 
for  sectors  wider  than  180°. 

2.  Intervertex  constraints — These  serve  to  propagate  la¬ 
bels  from  one  vertex  to  another,  as  in  Figure  6.  Here  the 
newly  assigned  label  of  SOLID  to  the  sector  between  Ei 
and  £j,  in  the  spoke  diagram  of  edge  £j  at  vertex  V\,  can 
be  propagated  along  E<  to  constrain  the  spoke  diagrams 
at  l «.  In  this  case  the  sector  between  E^  and  E$  must 
also  be  SOLID  The  two  pairs  of  edges  dcline  a  dihedral 
angle  along  £;.  This  type  of  constraint  propagation  is  al¬ 
ways  valid  because  the  state  can  change  between  SOLID 
and  HOLE  only  at  a  fare.  If  a  face  occurred  somewhere 
along  £j.  it  would  intersect  with  £j  at  a  point  other  than 
one  of  its  endpoints,  which  is  precluded  by  the  definition 
of  a  wire  frame 

3.  Vertex  Labeling  Constraints — Some  potential  label¬ 
ings  of  a  spoke  diagram  are  not  legal.  It  is  conceivable  to 
determine  the  set  of  legal  labelings  for  trihedral  vertices, 
tetrahedral  vertices,  etc.,  but  in  practice  this  becomes  un¬ 
wieldy.  As  implemented,  the  wire  frame  module  applies 
a  somewhat  weaker  constraint.  It  prohibits  any  labeling 
that  is  all  SOLIDs  or  all  HOLEs.  Since  such  an  assignment 
would  render  the  edge  nonexistent,  it  could  not  be  a  legal 
interpretation  of  the  wire  frame.  In  practice,  this  rather 
weak  constraint  has  generally  proved  satisfactory. 

Unfortunately,  there  are  numerous  special  cases  that  arise 
during  execution.  The  intravertex  constraints  must  distinguish 
between  angles  greater  or  less  than  180*  to  ensure  proper  label¬ 
ing.  The  intervertex  constraints  must  be  applied  properly  when 
the  spoke  diagrams  at  each  endpoint  do  not  coincide.  Extra 
spokes  and  missing  spokes  are  two  such  cases. 

The  wire  frame  module  is  a  control  structure  for  propagating 
these  three  constraints  throughout  a  wire  frame.  For  efficiency, 
the  implementation  actually  applies  the  three  constraints  simul¬ 
taneously.  It  includes  checks  for  completion  and  inconsistencies. 
Although  a  given  wire  frame  may  be  ambiguous,  (i.e.,  allow  more 
than  one  interpretation),  the  algorithm  is  guaranteed  to  termi- 
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Figure  7:  Hidden-line  removal. 

nate.  Each  step  assigns  a  label,  does  nothing,  or  detects  an 
inconsistency  and  quits.  Once  a  label  is  assigned,  it  is  never  re¬ 
moved.  The  algorithm  proceeds  until  all  sectors  have  labels  or  no 
change  has  been  detected  through  a  complete  deration.  If  some 
sectors  remain  nnlabeled,  the  algorithm  is  continued  separately 
for  each  possible  labeling  (SOLID  or  HOLE)  until  a  completely 
labeled  wire  frame  is  obtained. 

2.4.  The  Display  Module 

The  display  module  uses  the  labels  computed  by  the  wire 
frame  module  to  eliminate  lines  hidden  from  view.  Most  of  the 
lines  to  be  eliminated  can  be  readily  identified  by  the  process 
illustrated  in  Figure  7.  For  each  edge,  the  direction  to  the  view¬ 
point  is  computed  and  that  ray  is  superimposed  on  the  spoke  dia¬ 
gram.  If  the  ray  pierces  a  SOLID  sector,  that  edge  is  a  rearward¬ 
facing  edge  and  is  not  displayed.  If  the  ray  falls  into  a  HOLE 
sector,  further  processing  is  needed.  The  projection  of  the  edge 
is  checked  for  intersection  with  the  projection  of  all  other  visible 
edges.  If  no  such  intersection  exists,  the  edge  is  displayed.  If  an 
intersection  is  found,  the  intersecting  edge  is  examined  to  deter¬ 
mine  which  side  of  the  edge  is  occluded  (or  if  both  are).  The  edge 
is  split  at  its  point  of  intersection  and  each  part  is  handled  in  the 
same  manner  recursively.  It  should  be  noted  that  this  algorithm 
will  fail  to  eliminate  certain  occluded  edges  if  their  projections 
do  not  intersect  with  any  other  projected  edges.  A  less  efficient 
search  would  be  required  to  eliminate  these. 

Returning  to  the  example  of  the  Transamerica  Building  of 
Figure  1,  a  wire  frame  was  obtained  from  the  line  drawings  and 
was  subsequently  processed  by  the  wire  frame  module.  Figure  8 
shows  a  perspective  view  of  the  result,  with  hidden  lines  removed, 
indicating  realization  of  the  correct  wire  frame  and  the  successful 
assignment  of  solid  materia!  relative  to  it. 

2.5.  Summary 

The  class  of  objects  that  may  be  modeled  is  not  as  restrictive 
as  it  may  seem  at  first  glance.  Any  polyhedron  is  permissible 
and  may  contain  arbitrary  concavities  and  holes.  Curves  may  be 
approximated  by  a  number  of  line  segments.  Several  poly!  dra 
may  be  juxtaposed  in  any  manner. 

The  line  drawings  may  be  either  orthographic  or  perspective, 
and  from  any  vantage  point.  Accidental  alignmer's  pose  no  par- 
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Figure  8:  Hidden-line  removal  on  the  Transamerica  Building. 

ticular  problem.  A  unique  wire  frame  will  be  constructed  if  any 
drawing  is  in  general  position  (i.e.,  no  vertices  or  edges  coincide 
in  the  projection)  Reconstruction  of  a  unique  wire  frame  is  also 
guaranteed  if  corresponding  lines  in  each  view  are  designated. 

Just  as  a  set  of  line  drawings  may  define  more  than  one  wire 
frame,  some  wire  frames  may  define  more  than  one  object.  The 
wire  frame  module  will  derive  all  possible  interpretations  when 
confronted  with  an  ambiguous  wire  frame. 

Stronger  vertex-labeling  constraints  may  be  necessary  to  as¬ 
sure  the  correct  interpretation  of  some  objects.  Further  testing 
is  required  to  refine  the  constraints  and  demonstrate  the  ability 
to  derive  volumetric  descriptions  of  a  wide  range  of  polyhedral 
scenes. 


3.  Single  View 

We  have  described  a  procedure  that  makes  it  possible  to 
infer  the  shape  of  an  object  from  several  views.  A  Tteans  for  ac¬ 
complishing  this  task  from  a  single  view  is  also  desirable.  After 
all,  humans  appear  to  have  little  trouble  constructing  a  three- 
dimensional  representation  of  an  arbitrary  object  from  a  single 
line  drawing.  While  this  section  does  not  offer  a  method  for  ap¬ 
proximating  human  performance,  it  does  point  to  some  promising 
approaches  toward  that  objective. 

The  most  relevant  early  work  on  polyhedral-scene  interpre¬ 
tation  is  by  Mackworth  [0].  His  program,  POLY,  has  provided 
a  framework  for  subsequent  shape-from-line-drawing  methods. 
POLY  first  achieves  a  qualitative  interpretation  of  a  line  drawing 
by  parsing  the  line  segments  to  ascertain  which  of  them  are  con¬ 
vex,  concave,  or  occluding.  The  convex  and  concave  edges  yield 
constraints  lLat  can  often  be  used  to  determine  the  orientation 
of  the  polyhedral  faces  quantitatively.  By  using  the  now  famil¬ 
iar  technique,  the  orientations  of  faces  that  meet  at  an  edge  are 
restricted  to  lie  on  a  line  in  gradient  space  that  is  perpendicu¬ 
lar  to  the  image  line  of  that  edge.  Combining  these  constraints 
through  triangulation  often  yields  the  orientations  of  all  the  faces. 


Figure  9:  An  ambiguous  drawing. 


Figure  10:  Another  ambiguous  draw  ing. 

Unfortunately,  POLY  has  many  shortcomings  that  limit  its  com¬ 
petence,  as  pointed  out  by  Draper  [3].  He  presents  methods  for 
overcoming  some  of  POLY’s  limitations,  yet  human-level  perfor¬ 
mance  is  still  beyond  reach. 

One  observation  that  dampens  hope  of  ever  producing  an 
algorithmic  solution  to  the  problem  is  that  human  perception  of 
line  drawings  can  be  extremely  subjective.  Figure  9(a)  shows  a 
triangular  pyramid  that  can  be  variously  interpreted  as  either 
very  Rat  (b),  or  very  pointed  and  elongated  (c).  In  gradient 
space,  this  phenomenon  manifests  itself  in  ihe  fact  that  the  scale 
of  the  gradient  space  cannot  be  resolved  from  a  line  drawing 
algebraically  without  higher-level  assumptions.  In  fact,  neither 
the  scale  nor  the  origin  of  gradient  space  can  be  determined  by 
existing  methods  that  utilize  the  gradient-spacr  representation. 

Recent  work  by  Barnard  [2]  addresses  this  issue  of  subjec¬ 
tive  interpretation.  For  instance,  if  one  is  willing  to  assume  that 
the  pyramid  in  Figure  9  is  actually  composed  of  mutually  per¬ 
pendicular  fapes,  that  information  can  be  used  to  specify  the 
orientations  of  all  faces  of  the  object  uniquely.  Moreover,  what 
this  accomplishes  is  to  establish  both  the  origin  and  scale  of  gra¬ 
dient  space.  Barnard's  procedure  only  requires  identification  of 
three  lines  in  the  scene  known  to  be  orthogonal  (they  need  not 
intersect)  in  order  to  compute  these  quantities.  This  represents  a 
powerful  tool  when  coupled  with  the  purely  objective  approach  to 
the  interpretation  of  line  drawings.  The  requirement  for  finding 
a  set  of  mutually  orthogonal  lines  is  no  great  burden  when  <  any 
scenes  of  man-made  objects  are  being  examined,  but  one  would 
prefer  a  purely  general  method  for  the  subjective  interpretation 
of  line  drawings.  Heuristics  exploiting  parallelism,  symmetry. 
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and  compactness  of  description  may  provide  a  useful  inroad. 

The  determination  of  the  origin  and  scale  of  gradient  space 
is  in  itself  not  sufficient  for  the  interpretation  of  all  faces  in  a 
polyhedral  scene.  Figure  10  shows  an  object  and  its  gradient- 
space  interpretation.  Even  if  the  orientation  of  face  A  is  known 
exactly,  the  locations  in  gradient  space  of  faces  B,  C,  and  D  are 
still  underdetermined.  The  figure  is  analytically  ambiguous  but 
subjectively  resolvable;  additional  heuristics  may  be  necessary  to 
find  the  solution. 

Another  issue  inherent  in  the  analysis  of  line  drawings  is 
how  best  to  cope  with  imprecise  input.  Line  drawings  may  be 
extracted  from  real  images  or  may  be  hand-drawn.  One  would 
prefer  an  algorithm  that  does  not  degenerate  completely  when 
confronted  with  inaccurate  drawings.  A  drawing  of  an  “impossi¬ 
ble"  object,  that  is,  a  drawing  that  does  not  correspond  to  any 
geometrically  possible  object,  should  be  interpreted  as  the  “clos¬ 
est"  object  that  is  geometrically  permissible.  Kanade’s  algorithm 
|4j,  which  works  through  iterative  minimization  of  errors,  pro¬ 
vides  a  framework  for  achieving  this  goal.  One  can  conceive  of 
designing  a  system  that  theoretically  supports  only  orthographic 
line  drawings,  and  using  it  to  interpret  perspective  drawings.  If 
the  focal  length  is  sufficiently  large,  the  perspective  distortion 
might  be  treated  as  drawing  error  and  an  approximate  interpre¬ 
tation  obtained.  While  the  validity  of  this  approach  depends  on 
the  application  in  mind,  it  does  circumvent  the  difficulties  of  a 
truly  perspective  model. 

The  gradient-space  representation  is  unsuitable  for  analyzing 
perspective  drawings  [f>|.  The  primary  reason  is  the  inability  to 
capture  the  concept  of  sidedness  of  a  plane  in  gradient  space.  Sid¬ 
edness  reasoning  is  essential  to  the  interpretation  of  perspective 
drawings  because  either  side  of  a  plane  may  be  visible,  depend¬ 
ing  on  the  plane's  location  in  a  perspective  drawing.  Formalisms 
based  on  the  Gaussian  sphere  overcome  this  problem.  The  math¬ 
ematics  becomes  a  little  more  complex  (quadratic  versus  linear 
equations),  but  the  two  solutions  toeach  quadratic  equation,  cor¬ 
responding  to  the  two  sides  of  a  plane  in  three-dimensional  space, 
enable  quantitative  analysis  of  perfective  scenes. 

4.  Summary 

Recovering  the  shape  of  an  object  from  a  single  line  drawing 
of  that  object  is  a  difficult  problem.  Further  investigation  is 
necessary  to  achieve  human-level  competence. 

The  algorithm  presented  for  interpreting  scenes  from  multi¬ 
ple  views  embodies  a  novel  approach  to  a  long-standing  problem. 
The  type  of  spatial  reasoning  used  promises  to  be  applicable  in 
other  situations  as  well.  The  technique  may  be  successful  when 
only  a  portion  of  an  object  is  visible  and  may  perform  adequately 
even  with  inaccurate  line  drawings  (such  as  those  missing  a  line 
here  or  there).  It  is  a  local  reasoning  process  that  may  be  es¬ 
pecially  appropriate  for  supporting  higher-level  reasoning  about 
solid  objects. 
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ABSTRACT 

Matching  successive  frames  of  a  dynamic  image  se¬ 
quence  using  area  correlation  has  been  studied  for  many 
years  by  researchers  in  machine  vision.  Most  of  these  ef¬ 
forts  have  gone  into  improving  the  speed  and  the  accuracy 
of  correlation  matching  algorithms.  Yet,  the  displacement 
fields  produced  by  these  algorithms  are  often  incorrect  in 
homogeneous  areas  of  the  image  and  in  areas  which  are  visi¬ 
ble  in  one  frame,  but  are  occluded  in  the  succeeding  frames. 
Further,  these  displacement  fields  are  often  incorrect  even 
at  non-occluded  areas  that  border  occlusion  boundaries.  In 
this  paper,  we  present  a  confidence  measure  which  indicates 
the  reliability  of  each  displacement  vector  computed  by  a 
specific  hierarchical  correlation  matching  algorithm.  We 
also  provide  an  improved  hierarchical  matching  algorithm 
which  performs  particularly  well  near  occlusion  boundaries. 
We  demonstrate  these  with  experiments  performed  on  real 
image  sequences  taken  in  our  robotics  labaratory.  A  more 
detailed  version  of  this  work  appears  in  [Anan84], 

1.  INTRODUCTION 

One  of  the  powerful  techniques  that  have  been  studied  by 
researchers  in  image  processing  and  computer  vision  for  the 
purpose  of  matching  images  is  area  correlation  [Agga81a, 
Bard80,Hann74,  Mora81,  Genn80,  Burt83,  Glas83,  Wong78, 
Lawt84|.  Much  of  this  work  has  addressed  issues  in  choos¬ 
ing  a  useful  match  measure,  increasing  the  accuracy  of  the 
match,  and  in  reducing  the  computational  complexity  of 
the  matching  algorithms.  However,  most  of  the  current 
techniques  produce  false  matches  when  applied  to  scenes 
containing  occlusion,  i.e.,  where  a  certain  area  of  the  im¬ 
age  which  is  visible  in  one  frame  is  hidden  by  other  moving 
areas  in  the  succeeding  frames.  The  problem  include  the 
following: 

•  The  search  required  may  be  large. 

•  The  spatial  resolution  of  the  displacment  field  pro¬ 
duced  by  correlation  matching  becomes  poorer  as  the 
sise  of  the  sample  window  increases  [Genn80j. 
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•  Correlation  matching  tends  to  produce  false  matches 
in  areas  of  the  image  where  the  variation  is  lo' 

•  Correlation  matching  fails  most  miserably  in  areas  of 
the  image  that  are  occluded  and  the  non-occluded  ar¬ 
eas  that  border  them. 

Some  researchers  have  used  hierarchical  search  strate¬ 
gies  [Glax83,  Burt83,  Wong78]  to  reduce  the  amount  of 
search  required.  However,  the  problems  in  processing  scenes 
containing  occlusion  still  remain.  In  this  study,  we  use 
the  hierarchical  matching  algorithm  of  Glaser,  Reynolds, 
and  Anandan  [Glax83|  as  our  basis.  We  isolate  the  situa¬ 
tions  where  this  matching  algorithm  fails  by  computing  a 
confidence  measure  which  estimates  the  reliability  of  each 
displacement  vector  that  is  computed  by  the  matching  al¬ 
gorithm.  We  then  modify  the  hierarchical  search  strategy 
to  improve  the  results,  especially  near  occlusion  boundaries. 
The  result  is  a  computationally  efficient  matching  algorithm 
which  provides  a  dense  displacement  field  with  estimates  of 
reliability  of  each  displacement  vector. 

Section  2  of  this  paper  describes  the  various  types  of 
correlation  techniques  that  have  been  investigated  by  re¬ 
searchers.  Section  3  describes  some  of  the  work  done  by 
other  researchers  for  finding  a  confidence  measure,  and  de¬ 
scribes  the  measure  chosen  for  this  work.  Section  4  de¬ 
scribes  our  modifications  to  the  search  strategy.  Section  5 
describes  some  applications  and  the  possible  future  direc¬ 
tions  of  this  research. 

2.  TYPES  OF  CORRELATION  MEASURES 
AND  ALGORITHMS 

A  variety  of  correlation  measures  and  associated  search 
strategies  have  been  studied  by  researchers  in  the  field  of 
image-matching.  In  this  section  we  briefly  review  some  of 
these  measures  and  some  of  these  search  strategies.  The 
particular  choices  of  the  measure  and  the  search  strategy 
are  -ot  always  independent  of  each  other.  In  our  discussion 
below,  we  point  out  such  dependencies  where  they  occur. 

2.1  Types  of  Correlation 

Some  typical  correlation  match  measures  that  are  used 


by  researchers  are  described  in  (Hann74j.  These  include 
direct  correlation,  mean  normalised  correlation,  variance 
normalised  correlation,  sum  of  the  squares  of  the  differ¬ 
ences  between  corresponding  pixels  (SSD),  and  sum  of  the 
magnitudes  of  the  differences. 

A  comparative  study  of  direct,  mean-normalized  and 
variance-normalized  correlation  measures  can  be  found  in 
[Burt82|.  In  addition,  Burt  also  suggests  the  use  of  Laplaciai 
filtered  images  for  matching.  Although  this  filtering  process 
can  be  used  in  combination  with  any  of  the  above  match 
measures,  his  study  includes  only  the  Laplacian-filtered  di¬ 
rect  correlation.  Burt  shows  that  the  most  reliable  results 
are  consistently  obtained  by  choosing  correlation  with  both 
mean  and  variance  normalizations.  However,  this  process 
(especially,  variance  normalization)  is  computationally  ex¬ 
pensive.  Therefore,  he  recommends  the  computationally 
efficient  Laplacian-filtered  direct  correlation  which  appears 
relatively  insensitive  to  both  mean  and  constrast  changes 
between  the  images,  although  this  measure  performs  poorly 
in  the  presence  of  high  frequency  noise. 

The  reason  for  the  success  of  the  Laplacian-filtered 
correlation  process  is  that  the  mean  value  of  a  Laplacian- 
filtered  image  tends  toward  zero  as  the  sample  window  size 
increases.  Thus,  the  filtering  process  has  the  effect  of  mean- 
normalizing  the  correlation  values.  However,  we  found  that 
when  the  window  sizes  are  smaller  than  8x8,  the  mean  of  a 
sample  window,  though  small,  was  not  nearly  zero.  In  such 
cases  using  Laplacian-filtered  SSD  provides  more  accurate 
results  than  Laplacian-filtered  correlation.  This  is  because 
the  SSD  measure  is  senstive  to  the  difference  of  the  means 
of  the  two  areas  that  are  compared.  For  smaller  windows, 
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Table  1:  Comparing  Laplacian-filtered  correlation  with 
Laplacian-filterd  SSD. 


The  tests  were  conducted  using  the  Mandrill  Images  de¬ 
scribed  in  [Glaz83|.  Gaussian  noise  of  standard  deviation 
0,  5,  and  10  percent  of  the  intensity  range  were  added  to 
the  second  image  which  was  translated  by  (3,5).  The  table 
entries  are  percentage  of  pixels  with  correct  displacements. 
Matching  was  based  on  square  windows  of  width  3,  5,  and 
8  pixels. 


the  differences  of  the  means  tend  to  be  small,  even  though 
the  means  are  not  zero.  Our  own  experiments  suggest  (see 
Table  1)  that  for  a  fixed  size  of  the  matching  window,  the 
performance  of  Laplacian-filtered  SSD  exceeds  that  of  the 
Laplacian-filtered  correlation  using  the  same  sized  window. 
Based  on  this,  we  have  used  Laplacian-filtered  SSD  as  the 
match  measure  for  the  rest  of  our  study. 

Although  the  problem  of  mean  changes  can  be  elim¬ 
inated  by  using  Laplacian-filtered  SSD,  there  are  still  two 
sources  of  false  matches.  First,  contrast  variations  between 
the  images  degrades  the  performance  of  the  SSD  measure 
[Hann74j.  Second,  the  Laplacian-filtering  process  removes 
image  variations  below  a  certain  frequency,  thus  causing  the 
image  structures  to  repeat  beyond  a  distance  correspond¬ 
ing  to  the  filter  cut-off  wavelength.  Therefore,  if  the  search 
area  is  large,  there  is  greater  potential  for  the  occurrance 
of  false  matches.  In  the  following  section,  we  note  that  the 
search  strategy  used  can  be  helpful  in  alleviating  both  these 
difficulties. 

2.2  Search  Strategies 

There  are  only  a  limited  number  of  search  strategies 
that  have  been  employed  by  various  researchers.  The  most 
obvious  strategy  is  to  search  the  whole  area  within  the  ex¬ 
pected  maximum  displacement.  Hannah  [Hann74]  and  Gen- 
nery  [Genn80]  use  this  technique,  although  Gennery  men¬ 
tions  a  number  of  ways  to  cut  down  the  computational  cost 
of  the  search,  and  mentions  the  possibility  of  using  global 
techniques  to  get  approximate  estimates  which  are  then  im¬ 
proved  using  local  searches.  Lawton  [Lawt84]  uses  a  search 
technique  that  is  moat  suitable  for  the  case  of  pure  transla¬ 
tional  motion  of  the  camera.  In  this  case  a  global  search  is 
performed  for  the  focus  of  expansion  (FOE),  which  is  the 
intersection  of  the  translational  axis  and  the  image  plane. 
Specific  values  of  the  FOE  are  evaluated  and  for  each,  the 
local  searches  for  optimal  feature  matches  are  constrained 
to  lie  along  radial  lines  emanating  from  the  assumed  FOE. 

Wong  and  Hall  [Wong78|,  Glazeret.  al.  [Glaz83],  Burt 
et.  al.  [Burt83],  and  Moravec  [Mora81]  all  use  a  multi¬ 
resolution  coarse-fine  strategy,  but  there  are  important  dif¬ 
ferences  among  them.  Among  these  differences,  we  are  in¬ 
terested  in  the  fact  that  Wong  and  Hall  and  Moravec  used 
low-pass  filtered  images,  whereas  Glaser  et.  al.,  and  Burt 
et.  al.  used  band-pass  filtered  images.  Burt  et.  al.  used  a 
strategy  where  the  searches  at  the  different  levels  of  reso¬ 
lution  operate  independently  of  each  other.  This  results  in 
low-frequency  coarse  resolution  searches  detecting  large  dis¬ 
placements  and  higher-frequency  finer  resolution  searches 
detecting  smaller  displacements  in  the  image.  Glaser  et.  al. 
|G!az83]  use  a  strategy  which  utilizes  the  approximate  esti¬ 
mate  given  by  the  low-frequency,  coarse  resolution  searches 
as  a  starting  value  to  define  the  search  in  the  higher  fre¬ 
quency  images,  thereby  resulting  in  a  more  precise  displace¬ 
ment  estimate. 


For  large  displacements,  the  use  of  band-pass  filtered 
images  and  the  coarse-fine  strategy  for  matching  is  a  nat¬ 
ural  generalisation  of  the  Laplacian-filtered  matching  tech¬ 
nique.  In  this  approach,  at  any  given  level  of  resolution  the 
band-pass  filtered  image  corresponds  to  Laplacian-filtering 
the  low-pass  filtered  image  that  is  faithfully  representable 
(according  to  Nyquist  criterion)  at  that  resolution.  The  3x3 
search  area  used  both  by  Burt  et.  al.  and  Glaser  et.  al.  ef¬ 
fectively  limits  the  search  to  less  than  half-the  wavelength 
of  the  highest  frequency  information  available  at  each  level. 
This  restriction  also  helps  reduce  the  effect  of  false  matches 
that  may  arise  due  to  contrast  variation  between  the  im¬ 
ages. 

The  band-pass  filtered,  coarse-fine  search  strategy  tends 
to  introduces  some  problems  of  its  own.  At  occlusion  bound¬ 
aries,  where  there  is  a  discontinuity  in  the  displacement 
field,  the  coarse-resolution  processing  errors  usually  occur 
because  1)  sampling  windows  will  overlap  across  the  bound¬ 
aries  and  2)  the  low-pass  filtering  process  smooths  the  im¬ 
age  across  the  boundaries.  Since  each  pixel  at  a  coarse  level 
tends  to  cover  a  large  area  at  the  finest  levels,  these  coarse- 
level  errors  tend  to  cause  incorrect  initial  estimates  to  be 
used  at  the  fine-level  pixels,  thus  leading  to  a  search  in  ar¬ 
eas  which  do  not  include  the  correct  match.  Typically  this 
creates  a  large  area  near  the  occlusion  boundary  with  in¬ 
correct  displacement  field.  Since  these  errors  are  primarily 
due  to  the  hierarchical  search  strategy,  it  may  be  possible 
to  eliminate  some  of  these  by  using  a  single  level  search 
strategy.  However,  we  believe  a  better  approach  would  be 
to  maintain  the  coarse-fine  search  strategy,  but  try  to  rec¬ 
ognise  such  errors  as  they  happen.  Our  confidence  measure 
is,  in  fact,  an  attempt  to  do  precisely  that. 

3.  A  CONFIDENCE  MEASURE 

As  noted  in  the  previous  sections  most  correlation  matching 
algorithms  generate  false  matches  in  homogeneous  areas, 
i.e.,  where  there  is  a  lack  of  any  significant  image  struc¬ 
ture,  and  around  occlusion  regions.  Previous  work  that 
has  attempted  to  provide  smoothly  varying  dense  displace¬ 
ment  fields  (Horn80,  Giaz81,  Nage83,  Hild83)  usually  relies 
on  the  propagation  of  displacement  information  from  image 
areas  with  significant  intensity  variations  to  homogeneous 
areas.  However,  when  occlusion  is  also  present,  such  es¬ 
timates  at  the  occlusion  boundaries  are  usually  incorrect 
due  to  occlusion  effects  and  using  them  for  initial  estimates 
tends  to  confound  the  errors. 

There  have  been  efforts  by  Hannah  [Hann74],  Gennery 
[Genn80|  and  Burt  (Burt83j  to  understand  the  reliability 
of  correlation  matching  algorithms.  Hannah  observes  that 
both  the  sharpness  of  the  correlation  surface  at  the  point 
of  best  match  and  the  similarity  between  the  shape  of  the 
auto-correlation  and  the  cross-correlation  surfaces  can  be 
used  to  decide  about  the  reliablility  of  the  match.  However, 
she  does  not  provide  any  concrete  technique  for  using  this 


information.  Gennery’s  measure  requires  a  model  of  the 
camera  noise  and  scaling  effects  between  the  images  to  be 
matched.  This  requires  calibration  of  the  camera  set-up. 
Although  this  appears  to  be  a  robust  measure,  it  is  often 
the  case  that  such  calibration  is  difficult  due  to  changes  in 
illumination,  surface  reflectance,  etc.  We  believe  that  it 
is  possible  to  provide  a  confidence  measure  which  does  not 
depend  on  an  a-priori  model  of  the  camera  and  the  image 
noise.  Burt  provides  a  confidence  measure,  which  in  many 
ways  is  similar  to  our  own.  We  describe  Burt’s  measure  in 
greater  detail  in  section  3.3  and  compare  it  with  ours. 

3.1  Properties  of  the  SSD  surface 

We  define  an  SSD  surface  as  the  surface  formed  by 
considering  the  Laplacian-filtered  SSD  values  corresponding 
to  different  candidate  displacements  as  the  elevation  at  that 
displacement.  This  surface  appears  to  contain  a  wealth 
of  information  about  the  nature  of  the  image  structures 
at  the  point  being  matched.  Intuitively,  it  is  clear  that 
where  there  are  significant  intensity  variations  in  the  image, 
the  match  is  likely  to  be  reliable  and  unique,  whereas  at 
points  in  a  homogeneous  area,  this  is  not  so.  This  fact  is 
noticeable  in  the  shape  of  the  SSD  surface  corresponding 
to  such  points.  Usually,  the  SSD  surface  corresponding 
to  a  point  with  distinct  image  structure  tends  to  have  a 
sharp  valley  centered  at  the  best  match  value,  whereas  at  a 
homogeneous  point  the  SSD  surface  is  rather  flat. 

We  conducted  an  empirical  study  of  the  behaviour  SSD 
surface.  For  this  study,  we  created  a  pair  of  synthetic  im¬ 
ages  by  digitally  “cutting  and  pasting”  pieces  from  two  real 
images,  photographed  in  our  robotics  lab  [Elli84].  We  then 
selected  a  number  of  specific  points  corresponding  to  typical 
image  structures,  (e.g.,  intensity  corners,  homogenous  ar¬ 
eas,  occluded  areas,  etc.)  and  studied  the  behaviour  of  the 
Laplacian-filtered  SSD  surface  at  these  points  The  detailed 
results  of  this  study  are  described  in  (Anan84).  Figures  3.1 
and  3.2  show  examples  of  Buch  surfaces.  Fig.  3.1  corre¬ 
sponds  to  an  intensity  corner  which  is  visible  in  both  im¬ 
ages.  Fig.  3.2  corresponds  to  an  intensity  corner  in  the  first 
image  which  is  occluded  in  the  second  image.  These  sur¬ 
faces  are  inverted  in  the  displays  so  as  to  enhance  visibility. 
Thus  the  point  of  minimum  SSD  value  is  the  peak  in  these 
two  figures.  We  have  marked  the  view-angle  of  the  surface 
displays,  as  well  as  the  minimum  and  maximum  SSD  values 
on  the  surfaces.  In  each  figure,  the  point  of  minimum  SSD 
value  is  marked  with  a  “0*  and  the  point  corresponding  to 
the  correct  displacement  is  marked  with  a  “X*.  Note  that 
Fig.  3.1  shows  a  distinct  peak  and  the  point  of  minimum 
SSD  corresponds  to  the  true-match  point,  wberas  Fig.  3.2 
shows  erratic  behaviour  and  the  true-match  point  is  away 
from  the  point  of  minimum  SSD  value. 


Figure  3.1:  SSD  surface  at  a  corner 


The  behaviour  of  the  SSD  surface  [Anan84)  suggested 
that  the  confidence  in  the  correctness  of  the  match  esti¬ 
mate  should  be  directly  proportional  to  the  curvature  of  the 
SSD  surface,  and  inversely  proportional  to  the  SSD  value 
at  the  point  of  best  match.  Consider  the  normalised  second 
derivatives  of  the  SSD  surface  centered  at  the  point  of  best 
match  in  the  four  directions  0, 45, 90  and  135  degrees.  Each 
of  these  can  be  computed  numerically  using  a  1  x  3  Lapla- 
cian  operator  oriented  in  the  appropriate  direction.  We 
divide  these  curvatures  by  a  weighted  sum  of  the  three  SSD 
values  used  to  compute  the  curvature.  This  is  done  both 
in  order  to  normalise  the  curvature  to  be  between  0  and  1, 
and  to  make  it  inversely  proportional  to  the  minimum  SSD 
value. 

In  the  formulae  given  below,  the  SSD  surface  is  con¬ 
sidered  centered  at  the  point  of  best  match.  The  indexing 
is  relative  to  that  displacement,  with  index  (0,0)  referring 
to  the  displacement  corresponding  to  the  best  match: 

S(-l,-l)  5(-l,0)  5(-l,  1) 

5(0, -1)  5(0,0)  5(0,1) 

5(1, -1)  5(1,0)  5(1,1) 
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Figure  3.2:  SSD  surface  at  an  occluded  corner 


We  compute  the  four  normalised  directional  second 
derivatives  of  the  SSD  surface  as  follows: 


5(0,-l)-2«  5(0,0) +  5(0,1) 
5(0, -1)  + 2. 5(0,0) +  5(0,1) 

ri,  _  S(1,-1)-2*S(0,0)  +  S(-1,1) 
5(1,  -1)  +  2  *  5(0,0)  +  5(-l,  1) 

cgo=  5(-l,0)-2«  5(0,0)  +  5(1,0) 
S(— 1,0)  +  2  *  5(0,0)  +  5(1,0) 

C135  =  S(-1.-1)-2«S(0.°)  +  S(1,1) 
S(— 1,-1)  +  2  *  S(0,0)  +  5(1, 1) 


The  detailed  study  of  these  surfaces  [Anan84]  demon¬ 
strated  how  the  SSD  surface  usually  captures  much  of  the 
information  about  the  image  structures  as  well  as  occlusion 
effects.  Where  a  proper  match  exists  (i.e.,  the  non-occluded 
regions),  the  SSD  value  at  the  point  of  best  match  generally 
seems  to  be  low.  At  occlusion  areas  this  value  is  generally 
higher,  and  the  selection  of  a  proper  match  becomes  diffi¬ 
cult.  We  also  found  that  the  curvature  of  the  SSD  surface 
along  different  directions  reflects  the  degree  of  variation  in 
the  image  along  those  directions,  and  hence  the  uniqueness 
of  the  match  estimate  along  that  direction.  These  facts  are 
combined  in  our  confidence  measure  described  in  the  next 
section. 

3.2  The  Confidence  Measure 


where  S(i,j)  denotes  the  SSD  value  at  position  (», ;)  rel¬ 
ative  to  the  point  of  best  match. 

At  this  point  various  possibilities  arise.  Since  each 
of  these  four  measures  provide  information  specific  to  the 
corresponding  direction  they  could  all  be  separately  main¬ 
tained.  Alternatively,  a  conservative  measure  will  be  to 
choose  the  minimum  of  these  four  measures.  We  have 
adopted  this  latter  approach  for  our  study.  Hence  our  con¬ 
fidence  measure  is, 

MIN(CQ,  C4S,  C90,  (7135) 
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3.3  The  Confidence  Measure  of  Bnrt,  Yen  and  Xn 

Burt,  Yen  and  Xu  |Burt83]  describe  a  confidence  mea¬ 
sure  which  is  very  similar  in  form  to  the  one  describe  here. 
Their  measure  uses  the  Laplacian-filtered  correlation  sur¬ 
face  and  takes  the  four  directional  second  derivatives  in  a 
manner  similar  to  ours.  Their  normalization  technique  is 
also  similar  to  ours.  However,  the  two  measures  differ  in 
some  significant  ways. 

First,  our  measure  is  based  on  the  SSD  surface,  whereas 
their  measure  is  based  on  the  correlation  surface.  While  the 
SSD  function  remains  strictly  positive,  the  correlation  func¬ 
tion  can  also  have  negative  values.  The  presence  of  negative 
values  can  cause  the  normalized  values  to  go  below  0  as  well 
as  above  1.  Although  large  negative  values  around  the  point 
of  best  match  indicates  a  strong  peak,  in  such  situations, 
their  confidence  measure  is  below  zero. 

Second,  in  the  strategy  used  by  Burt,  et  al.  conifines 
the  search  at  each  level  to  a  3x3  area  centered  around 
zero  displacement.  Thi«  means  that  the  curvature  measures 
are  taken  at  a  point  within  this  window,  even  though  the 
actual  displacement  may  be  large  In  band-pass  filtered  im¬ 
ages,  the  structures  can  repeat  themselves  beyond  a  certain 
distance,  thus  causing  false  matches  with  high  confidence 
(i.e.,  a  unique  match  within  the  search  window)  to  occur. 

Finally,  Burt’s  second  derivative  operators  are  centered 
at  (0,0),  even  when  the  best  match  point  occurs  elsewhere 
within  the  3x3  window.  The  1x3  operator  is  too  small  to 
be  a  good  approximation  of  the  curvature  at  any  point  away 
from  where  it  is  centered.  Therefore,  if  the  displacement  is 
not  (0,0),  Burt’s  measure  may  not  be  a  good  indication  of 
the  directional  curvatures  at  the  true  match  point, 

3.4  A  Demonstration 


Figure  3.3:  Folgers  Image  -  first  frame 


Having  defined  our  confidence  measure,  we  now  pro¬ 
ceed  to  demonstrate  its  utility  with  an  example.  For  this 
purpose,  we  chose  a  pair  of  real  images,  called  Folgert  im¬ 
ages.  The  images  are  shown  in  figures  3.3  and  3.4  and 
constitute  a  stereo  pair.  There  are  two  prominent  sur¬ 
faces  at  different  depths,  viz.,  the  Folgers  coffee  can  and 
the  textured  background  with  the  plant.  Occlusion  at  the 
left  side  of  the  can  is  clearly  visible.  This  area  in  the  first 
image  has  no  matches  in  the  second  image.  We  used  the 
Laplacian-filtered  SSD  as  the  match  measure  for  computing 
the  displacement  field  and  the  confidence  measure.  Figure 
3.5a  displays  the  displacement  field  and  Figure  3.5b  displays 
the  confidence  measure.  The  brightness  of  the  confidence 
measure  is  proportional  to  the  degree  of  confidence.  The 
displacement  field  has  been  sub-sampled  for  convenience  of 
display. 

These  figures  reveal  some  important  facts  regarding 
the  confidence  measure.  First,  in  large  homogeneous  areas 
of  the  image  the  confidence  measure  is  low  and  the  dis¬ 
placement  estimates  are  often  incorrect.  For  illustration 
purposes,  we  have  drawn  marked  one  such  area  as  “A”  in 
Figures  3.3,  3.5a,  and  3.5b.  Second,  the  confidence  measure 
is  also  low  along  straight  edges,  although  the  component  of 
the  displacement  vectors  normal  to  the  edge  often  seems 
to  correct.  One  such  area  in  the  image  is  marked  “B*  in 
the  three  figures.  Third,  the  confidence  measure  is  low  in 
occluded  areas  and  around  occlusion  boundaries.  One  mch 
area  is  marked  ■ C”  in  the  three  figures.  Finally,  although 
the  confidence  measure  is  low  both  in  homogeneous  areas 
and  occluded  areas,  it  does  not  discriminate  between  the 
two. 

In  order  to  further  illustrate  the  correlation  between 
the  confidence  measure  and  the  accuracy  of  the  displace¬ 
ment  estimates,  Figure  3.6  displays  only  those  displacement 
values  which  have  a  confidence  of  0.3  or  more. 


Figure  3.4:  Folgen  Image  -  second  frame 
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Figure  3.6a:  The  Folgers  displacement  field 


Figure  3.6b:  The  confidence  measure  for  the  Folgers 
pair 

4.  A  MODIFIED  SEARCH  STRATEGY 

The  confidence  measure  which  we  described  in  the  previous 
chapter  is  useful  in  isolating  the  areas  in  an  image  where  the 
displacement  estimates  are  unreliable,  and  often  incorrect. 
In  this  chapter  we  describe  two  modifications  to  the  search 
strategy  of  Glaser,  Reynolds  and  Anandan  which  signifi¬ 
cantly  reduce  the  errors  in  the  displacement  field,  particu¬ 
larly  near  occlusion  boundaries. 

We  describe  briefly  the  search  strategy  of  Glaser,  et. 
al.  in  order  to  familiarise  the  reader  with  the  terminol¬ 
ogy  involved.  The  search  begins  at  an  appropriately  coarse 
level  so  that  ail  image  displacements  are  less  than  one  pixel 
distance  at  that  level.  For  each  pixel  in  the  first  frame  at 
the  coarse  level  (say  level  l ),  matches  are  found  within  a 
3x3  window  centered  around  the  corresponding  pixel  in 


Figure  3.0:  Folgers  displacements  with  confidence 
greater  than  0.3 

This  figure  shows  only  the  displacement  vectors  in  Fig  3.6a 
which  have  a  confidence  measure  greater  than  0.3. 


the  second  frame.  Each  of  these  estimates  ar»  then  pro¬ 
jected  to  the  four  pixels  at  level  l  +  1  of  the  pyramid  that 
are  directly  covered  by  each  pixel  at  level  1 .  The  displace¬ 
ment  estimates  have  to  be  multiplied  by  2  in  order  to  take 
into  account  the  reduction  in  the  pixel-width  between  lev¬ 
els  i  and  /  +  1 .  For  each  pixel  at  level  /  +  1 ,  the  search 
is  conducted  in  a  3x3  area  around  these  estimates.  This 
process  of  projection  and  search  is  continued  down  to  the 
finest  level  of  the  image  pyramid. 

Both  modifications  concern  the  projection  of  the  dis¬ 
placement  estimates  from  a  coarse  level  to  the  next  finer 
level.  First,  we  restrict  the  projection  of  coarse  estimates 
to  only  those  with  high  confidence.  Second,  we  allow  the 
coarse  level  estimate  at  each  pixel  to  be  projected  in  an  area 
larger  than  the  2x2  area  directly  covered  by  that  pixel. 
Both  these  modifications  are  explained  in  greater  detail  in 
the  following  sections. 

4.1  Restricting  Projection  to  High  Confidence  Es¬ 
timates 

The  restriction  of  projection  to  only  high  confidence 
estimates  is  suggested  in  Glaser,  et.  al.  |Glaz83].  The 
motivation  for  this  idea  stems  from  the  fact  that  when  in¬ 
correct  coarse-level  estimates  are  projected  down,  the  3x3 
searches  at  the  finer  levels  are  conducted  in  areas  of  the  sec¬ 
ond  frame  that  do  not  include  the  true-match  point.  This 
causes  incorrect  matches  in  all  the  subsequent  levels.  If  on 
the  other  hand,  these  incorrect  coarse-level  estimates  are  al¬ 
together  rejected,  the  finer-level  searches  eau  be  conducted 
over  larger  area  than  the  usual  3x3  windows,  and  the 
true-match  can  perhaps  be  recovered. 
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At  any  level  l  of  the  pyramid,  the  displacement  up¬ 
dates  at  pixels  where  the  confidence  measure  is  low  can  be 
suppressed  (this  can  be  achieved  by  a  simple  threshold  on 
the  confidence  measure).  Wherever  such  updates  are  sup¬ 
pressed,  we  simply  pass  the  displacement  estimates  from 
the  parent  pixel  at  level  1  -  1  to  the  children  pixels  at 
level  l  +  1.  Assume  that  at  such  a  pixel  at  level  I,  we  are 
searching  over  a  window  of  radius  r  (at  level  l ),  centered 
about  the  displacement  estimate  from  level  l-l.  This  cor¬ 
responds  to  a  window  of  radius  2  x  r  at  the  next  finer  level. 
Hence  the  lack  of  update  at  level  1  would  require  that  we 
search  over  a  window  of  radius  2  x  r  at  the  children  pix¬ 
els  at  level  1+1.  If  there  is  still  no  update  at  this  level 
the  search  window  radius  should  be  doubled  at  the  next 
level  below,  and  so  on.  This  expanded  search  strategy  is 
illustrated  in  Figure  4.1. 

4.2  Modified  Projection  of  Coarse  Estimates 

Our  second  modification  to  the  search  strategy  involves 
the  manner  in  which  the  coarse  level  displacement  estimates 
are  propagated  to  the  fine  level.  The  strategy  of  Glaser, 
et.  al.  projects  the  displacement  value  of  a  parent  pixel  as 
the  estimate  for  all  of  its  four  children  at  the  next  finer 
level.  At  areas  which  are  near  discontintuities  in  the  dis¬ 
placement  field  (e.g.,  occlusion  boundaries),  this  approach 
can  cause  incorrect  estimates  to  be  projected  from  a  coarse- 
level  pixel  to  the  finer  level  pixels.  This  occurs  because,  at  a 
coarse-level  of  resolution  the  boundary  of  discontinuity  can 
be  placed  only  within  a  coarse  accuracy.  For  fine-level  pix¬ 
els  along  one  side  of  the  boundary,  it  is  then  possible  that 
the  coarse  estimates  from  the  other  side  of  the  boundary  are 
projected  down.  This  causes  the  search  at  the  subsequent 
levels  to  find  incorrect  matches. 

We  propose  a  slightly  different  method  of  projecting 
the  coarse  level  displacement  estimates  to  the  next  finer 
level.  This  idea  is  based  on  the  “overlapping"  pyramid  idea 
of  Burt,  et.  al.  [Burt80j.  Each  coarse  level  pixel  covers  a 
4x4  area  at  the  next  finer  level,  rather  than  the  usual  2x2. 
In  this  manner,  each  pixel  at  the  finer  level  l  +  1  is  con¬ 
sidered  to  have  four  potential  parents  at  the  coarser  level  1 
(see  Figure  4.2).  We  consider  all  the  four  estimates  as  pos¬ 
sible  initial  estimates  for  the  search  at  level  I  and  conduct 
searches  around  each  of  these  estimates.  The  displacement 
corresponding  to  the  best  match  in  this  expanded  search 
area  is  then  chosen  as  the  updated  displacement  estimate 
for  that  pixel  at  level  1+1.  hi  this  way  the  pixels  along  the 
boundaries  of  displacement  discontinuities  are  not  bound  to 
an  incorrect  coarse  estimate.  This  allows  for  more  precise 
placement  of  the  boundaries  of  displacement  discontinuities 
at  the  finer  level. 

It  is  obvious  that  we  can  combine  both  the  modifica¬ 
tions  into  a  uniform  algorithm.  This  would  involve  choosing 
the  appropriate  search  radius  for  each  of  the  estimates  of 
the  parent  pixels  according  to  their  confidence  value  and 
their  search  radius  at  level  l . 
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Figure  4.1:  The  expanded  search  area 

Pixel  (0, 2)  at  level  l  has  a  low-confidence  dispalcement 
update.  Therefore,  its  initial  estimate  from  level  /  —  1  is 
passed  down  to  its  children  pixels  at  level  /  4- 1 .  The  ex¬ 
panded  search  area  is  shown  cross-hatched. 


Figure  4.2:  The  overlapped  pyramid  projection 

The  thick  double  lines  show  pixel  boundaries  at  level  /. 
The  thin  lines  show  pixel  boundaries  at  level  1  +  1 .  The 
projection  area  of  level  1  pixels  1  and  4  are  shown  in  the 
figure.  Note  that  each  pixel  at  level  l  +  1  has  four  parents 
at  level  l. 
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Figure  4.3:  Tbj  Poster  Image  -  first  frame 


Figure  4.4:  The  Poster  Image  -  second  frame 


In  Figures  4.3  through  4.8,  we  demonstrate  the  effects 
of  applying  both  these  modifications  to  the  search  strat¬ 
egy-  Fig  4-3  and  4.4  are  two  images  from  the  sequence 
called  poster  images.  The  displacement  estimates  based  on 
the  old  search  strategy  are  shown  in  Fig  4.5.  As  in  the 
case  of  our  figures  in  Section  3,  the  displacement  estimates 
have  been  subsampled  to  enhance  visibility.  The  shaded  ar¬ 
eas  have  estimates  with  confidence  below  0.3  and  the  white 
areas  have  estimates  with  confidence  above  0.3.  Note  the 
predominance  of  the  low  confidence  values  around  the  occlu¬ 
sion  boundary.  Fig  4.6  displays  the  displacement  estimates 
generated  by  a  search  strategy  which  incorporates  only  the 
first  of  the  two  modifications,  vis.,  suppressing  low  confi¬ 
dence  estimates.  Again,  the  shaded  areas  are  areas  of  low 
confidence  (below  0.3).  We  make  the  following  observations 
about  Figures  4.5  and  4.6.  First,  the  low-confidence  areas 


Figure  4.0:  Results  using  restricted  projection 


are  rectangular  in  shape.  This  is  due  to  the  strict  1  to  4 
projection  used  in  these  search  strategies.  In  both  cases  if 
an  error  is  made  at  a  coarse  level  that  is  not  suppressed,  it 
is  possible  that  the  search  areas  at  the  finer  levels  for  ail  the 
children  pixels  may  no  longer  include  the  true  match  points. 
Since  no  information  is  used  from  the  correct  neighbours, 
these  go  uncorrected.  Second,  the  restriction  of  projec¬ 
tion  of  coarse  estimates  (Fig.  4.6)  seems  to  improve  the 
displacement  estimates  in  some  areas  (of  the  background), 
whereas  introduces  more  errors  at  others  (near  the  occlu¬ 
sion  boundary).  This  is  because  some  of  the  low-confidence 
coarse  estimates  are  actually  correct.  Eliminating  these  and 
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Figure  4.7:  Results  using  overlapped  projection 


expanding  the  search  area  can  lead  to  false  matches  due  to 
repeated  features. 

Fig  4.7  displays  the  displacement  estimates  superim¬ 
posed  on  low  confidence  areas  as  provided  by  a  search  strat¬ 
egy  which  incorporates  only  the  second  modification,  vis., 
the  overlapped  pyramid  projection.  Fig  4.8  displays  the  dis¬ 
placement  estimates  superimposed  on  the  shaded  low  con¬ 
fidence  areas  as  provided  by  a  search  strategy  that  incor¬ 
porates  both  the  modifications  mentioned  above.  In  Fig¬ 
ures  4.7  and  4.8,  is  easy  to  see  that  a  dramatic  reduction 
in  the  sise  of  the  low-confidence  areas  has  been  achieved. 
Figure  4.7  shows  that  the  modified  projection  strategy  pro¬ 
vides  the  major  contribution  to  the  improvement  in  the  dis¬ 
placement  field.  This  is  so  because,  in  this  approach  the 
coarse-estimates  are  not  altogether  eliminated.  Instead,  in¬ 
formation  is  used  from  neighbours  who  may  have  correct 
estimates. 

Later  in  this  paper,  we  discuss  other  possible  ways  of 
utilizing  the  confidence  measure  to  improve  the  matching 
results,  all  of  which  are  currently  under  investigation. 

6.  APPLICATIONS  AND  FUTURE  WORK 

Thus  far  in  this  report  we  have  described  our  confidence 
measure,  and  demonstrated  its  use  in  our  modification  to 
the  search  strategy.  This  measure  can  also  serve  as  useful 
piece  of  information  for  techniques  that  process  the  flow 
field.  In  the  following  sections,  we  outline  the  immediate 
applications  of  the  confidence  measure,  and  describe  the 
directions  in  which  this  study  can  be  extended  to  include 
the  various  modifications  necessary  for  other  applications. 

6.1  Use  by  Parameter  Computation  Algorithms 
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Figure  4.8:  Results  using  both  our  modifications 

One  of  the  primary  uses  of  image  displacement  fields 
has  been  to  recover  camera  and  object  motion  parameters 
and  thereby  the  depth  and  the  orientation  of  the  image 
surfaces.  Typically,  techniques  that  address  these  prob¬ 
lems  involve  solving  a  system  of  equations  [Long81,  Tsai84] 
or  minimizing  an  error  measure  [Adiv84,  Rieg84,  Praz80, 
Lawt84). 

These  techniques  can  use  the  confidence  measure  in 
two  ways.  The  first  method  is  to  eliminate  the  displace¬ 
ment  estimates  with  low  confidence  measures.  The  sec¬ 
ond  method  is  based  on  the  observation  that  some  of  these 
techniques  compute  a  global  error  or  transform  which  has 
contributions  from  each  displacement  vector.  This  confi¬ 
dence  can  be  weighted  by  the  confidence  measures.  In 
this  way  less  accurate  displacement  vectors  (which  typi¬ 
cally  have  lower  confidence),  contribute  less  to  the  opti¬ 
mization  process,  thus  enhancing  the  reliability  of  that  pro¬ 
cess.  As  an  example  of  this  latter  use,  we  point  to  Adiv 
[Adiv84],  who  attempts  to  segment  the  image  into  regions 
which  have  consistent  displacement  fields  within  those  re¬ 
gions  in  order  to  compute  the  3-d  motion  parameters  corre¬ 
sponding  to  these  fields.  His  technique  is  a  multi-stage  one, 
and  involves  transformations  from  the  displacement  vectors 
to  affine-transfonnation  parameter  space,  as  well  as  least- 
square-error  fits  of  3-d  transformations  to  the  displacement 
fields.  For  both  these  purposes,  he  uses  the  confidence  mea¬ 
sure  to  weight  the  contribution  of  a  displacement  vector  to 
this  transformation  and  the  error  measure. 

5.2  Future  Work 

Directional  Information  about  Matching 

The  demonstrations  of  the  behaviour  of  the  SSD  sur- 
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face  at  typical  areas  of  the  image  with  directional  structures 
(e.g.,  edges)  clearly  showed  that  such  directional  informa¬ 
tion  was  indeed  noticeable  in  the  shape  of  the  SSD  surfaces. 
More  specifically,  we  noted  earlier  that  along  edges  in  the 
images,  we  see  a  ridge  like  SSD  surface  where  the  orienta¬ 
tion  of  the  ridge  corresponds  to  the  orientation  of  the  edge 
in  the  image.  Hence,  the  directional  confidence  measure 
along  the  direction  of  the  edge  is  low,  whereas  it  is  high 
in  the  direction  perpendicular  to  it.  It  has  also  been  well 
recognized  by  many  researchers  (e.g,  [Glaz81,Horn80])  that 
a  directional  feature  in  the  image  (say,  an  edge)  can  pro¬ 
vide  reliable  information  about  the  component  of  the  dis¬ 
placements  in  the  direction  perpendicular  to  that  feature, 
whereas  it  can  provide  no  information  about  the  component 
of  the  displacements  along  the  direction  of  that  feature. 

There  are  two  possible  ways  in  which  this  directional 
confidence  information  can  be  utilized.  The  first  is  to  use 
it  in  an  algorithm  similar  to  the  modified  search  strategy 
described  in  section  4.  In  this  case,  the  search  area  would 
not  be  expanded  along  the  direction  where  the  SSD  surface 
shows  significant  variations.  Instead,  we  can  expand  the 
search  area  in  the  direction  along  the  ridge,  thus  obtaining 
a  somewhat  rectangular  search  area. 

The  second  method  is  to  use  these  in  an  algorithm 
that  propagates  information  between  neighbouring  pixels 
especially  along  edges  and  curves  in  the  image  in  order  to 
bring  together  reliable  information  about  different  direc¬ 
tions.  One  way  of  doing  this  is  described  in  [Glaz81j.  Each 
pixel  provides  a  linear  constraint  equation  on  the  displace¬ 
ment  vector  at  that  pixel.  In  Glazer’s  approach,  a  least- 
square-error  solution  for  the  system  of  the  constraint  equa¬ 
tions  from  neighbouring  pixels  along  an  edge  is  considered 
to  be  the  true  displacement  vector  for  all  pixels  along  the 
edge.  Given  that  the  SSD  surface  captures  the  directional 
information,  we  have  available  to  us  information  similar  to 
these  constraint  equations.  An  added  benefit  of  using  the 
SSD  surface  over  the  constraint  equations  is  the  fact  that 
the  SSD  surfaces  also  provide  information  about  noise  vari¬ 
ations  and  occlusion  effects. 

A  more  general  way  of  using  neighbour  information 
for  the  improvement  of  the  displacement  estimates  is  to 
use  relaxation-smoothing  techniques.  These  are  similar  to 
the  now  classical  Horn  and  Scbunk  smoothing  technique 
[Horn80],  or  more  closely  to  the  constrained-smoothing  ap¬ 
proach  described  by  Nagel  [Nage83j.  Nagel’s  approach  inte¬ 
grates  the  intensity  gradient  constraint  of  Horn  and  Schunk 
with  a  smoothness  constraint  on  the  flow  field  which  is 
somewhat  different  from  theirs.  The  smoothness  constraint 
usually  involves  the  minimization  of  some  function  of  the 
spatial-derivatives  of  the  displacement  field.  Nagel  modi¬ 
fies  the  function  chosen  by  Horn  and  Schunk  with  weights 
that  are  inversely  proportional  to  the  intensity  gradients  at 
a  given  pixel.  At  each  pixel,  the  component  of  the  displace¬ 
ment  vector  along  the  direction  of  the  intensity  gradient  is 
not  allowed  to  vary  significantly,  whereas  the  component  in 
the  normal  direction  is  allowed  to  change  more  freely. 


Our  study  of  the  Laplacian-filtered  SSD  surface  sug¬ 
gests  that  we  can  weight  the  function  to  be  minimized  using 
information  from  the  SSD  surface.  Since  Nagel’s  smooth¬ 
ness  constraint  uses  image  gradient  information  only  from 
image,  it  is  not  sensitive  to  effects  of  noise  and  illumina¬ 
tion  in  both  images  and  to  occlusion  phenomena.  The  SSD 
surface  includes  such  information  and  is  therefore  likely  to 
prove  more  useful,  particularly  in  scenes  containing  occlu¬ 
sion.  One  of  our  future  goals  is  to  pursue  this  approach  and 
compare  it  with  the  approach  of  Nagel. 

Recognition  and  Processing  of  Occlusion 

Occlusion,  although  a  source  of  failure  and  frustration 
for  most  algorithms  that  attempt  to  produce  dense  displace¬ 
ment  fields,  is  very  useful  for  the  purposes  of  segmentation 
of  the  image  into  objects  at  different  depth  or  with  different 
movements.  Therefore,  any  process  that  detects  occlusion 
very  early  in  the  processing  can  be  useful  for  focus  of  at¬ 
tention,  tracking,  and  more  accurate  computation  of  the 
various  image  properties. 

The  confidence  measure  discussed  in  this  study  pro¬ 
duces  low  values  for  both  homogeneous  areas  as  well  as  oc¬ 
clusion  areas.  However,  often  it  may  be  possible  to  separate 
these  situations  using  the  information  in  the  SSD  surface. 
In  real  world  images,  it  is  usually  the  case  that  where  there 
are  occlusion  boundaries,  there  are  also  discontinuities  in 
the  image  texture.  This  suggests  that  all  values  on  the 
cross-SSD  surface  will  usually  be  high  at  occlusion  bound¬ 
aries,  whereas  they  will  be  uniformly  low  at  homogeneous 
areas.  This  observation  can  be  useful  in  the  identification  of 

occlusion  areas  in  an  image.  These  issues  will  be  the  focus 
of  our  own  future  work  in  this  direction. 
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Abstract 

In  this  paper  we  discuss  die  use  of  map  descriptions  to  guide  the 
extraction  of  man-made  and  natural  features  from  aerial  imagery.  An 
approach  to  image  analysis  using  a  region-based  segmentation  system 
is  described.  This  segmentation  system  has  been  used  to  search  a 
database  of  images  that  are  in  correspondence  with  a  geodetic  map  to 
find  occurrences  of  known  buildings,  roads,  and  natural  features.  The 
map  predicts  the  approximate  appearance  and  position  of  a  feature  in 
an  image.  The  map  also  predicts  the  area  of  uncertainty  caused  by 
errors  in  the  image  to  map  correspondence.  The  segmentation  pro.css 
then  searches  for  image  regions  that  satisfy  2-dimcnsional  shape  and 
intensity  criteria.  If  no  initial  region  is  found,  the  process  attempts  to 
merge  together  those  regions  that  may  satisfy  these  criteria.  Several 
detailed  examples  of  the  segmentation  process  arc  given. 

1 .  Introduction 

This  paper  describes  machineseg.  a  program  that  performs  map- 
guided  image  segmentation.  It  uses  map  knowledge  to  control  and 
guide  the  extraction  of  man-made  and  natural  features  from  aerial 
imagery  using  region-growing  techniques.  We  use  the  conceptmap 
database1  from  the  maps  system2  as  our  source  of  map  knowledge.  In 
the  conceptmap  database,  map  knowledge  is  represented  as  three 
dimensional  descriptions  of  man-made  features,  natural  features,  and 
conceptual  features.  Kxamplcs  of  man-made  features  arc  buildings, 
roads,  and  bridges;  natural  features  arc  rivers;  lakes,  and  forests,  and 
conceptual  features  are  political  boundaries.  residential 
neighborhoods,  and  business  areas.  These  feature  positions  are 
represented  in  the  map  database  in  terms  of 
<laiitude.  longitude. elcvatioiO.  In  this  paper  we  will  discuss  the 
extraction  of  man-made  and  natural  features. 
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In  this  paper  w?  refer  to  regions  as  areas  of  morc-or-lcss  uniform 
pixel  intensity  which  may  or  may  not  have  interpretations  as  real  world 
surfaces  or  objects,  Features  arc  regions  that  have  been  recognized  and 
interpreted  by  a  program,  or  have  been  ou'lined  by  a  human,  or  can  be 
characterized  by  a  simple  set  of  position,  shape,  size,  and  spectral 
properties.  Image  segmentation  is  the  process  of  generating  candidate 
regions  for  the  feature  extraction  process.  Feature  extraction  is  the 
recognition  of  a  region  with  particular  properties  by  application  of  one 
or  more  tests.  The  result  of  feature  extraction  is  the  generation  of  a  set 
of  regions  in  the  image  which  satisfy  feature  extraction  criteria 
specified  by  a  user  or  high-level  process.  In  machineseg  we  have 
combined  segmentation  and  labeling  so  that  the  segmentation  process 
presents  candidates  for  evaluation  to  the  labeling  process.  Once  a 
region  is  identified  as  a  feature,  special  evaluation  procedures  are  used. 

2.  Map-Guided  Segmentation 

The  notion  of  map-guided  image  segmentation  is  not  a  new  one. 
Many  researchers  have  discussed  the  use  of  a  priori  knowledge  of 
various  object  features  such  as  size,  shape,  orientation,  and  color  to 
extract  and  identify  features  from  an  image.  However,  there  are  few,  if 
any,  examples  of  systems  that  can  systematically  search  through  a 
database  of  images  looking  for  examples  of  particular  objects  or  classes 
of  objects.  In  this  paper  we  present  one  such  system  which  uses 
constraints  derived  from  a  map  database  to  perform  segmentation  in 
aerial  imagery. 

It  is  important  to  characterize  what  we  mean  by  "map-guided” 
image  segmentation. 

Map-guided  image  segmentation  is  the  application  of 
task-independent  spatial  knowledge  to  the  analysis  of  a 
particular  image  using  an  explicit  map-to- image 
correspondence  derived  from  camera  and  terrain  models. 

Map-guided  segmentation  is  not  interactive  editing  or  computation  of 
descriptions  in  the  image  domain,  since  these  descriptions  arc  valid 
only  for  one  specific  image.  In  maps  there  is  a  geodetic  description 
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«laiiiuJc.longilude.clevaiiuii>)  for  each  map  entity  in  ihc  CONCIPTMAP 
database.  ITiis  description  is  in  terms  of  points.  lines,  and  polygons,  or 
collections  of  these  primitives.  Features  such  as  buildings,  bridges, 
and  roads  have  additional  attributes  describing  their  elevation  above 
the  local  terrain,  as  well  as  their  composition  and  appearance.  The 
location  of  each  map  feature  in  the  database  can  be  projected  onto  a 
new  image  using  a  map-io-imagc  correspondence  maintained  by  MAPS. 
Likewise,  a  new  map  feature  can  be  projected  onto  the  existing  image 
database.  If  camera  model  errors  arc  known,  one  can  directly  calculate 
an  uncertainly  for  image  search  windows.  Further,  as  new  features  are 
acquired  their  positions  can  be  directly  integrated  into  the  map 
database. 

Figure  1  gives  a  schematic  description  of  the  map-guided  feature 
extraction  process  in  maps.  There  are  two  basic  methods  for  applying 
map  knowledge  to  the  extraction  of  features  from  aerial  imagery.  The 
first  method  uses  generic  knowledge  about  the  shape,  composition  and 
spectral  properties  of  man-made  and  natural  features.  The  second  uses 
map-based  template  descriptions.  These  descriptions  arc  stored  in  the 
CONCEPTMaP  database  ..nd  represent  knowledge  about  known 
buildings,  roads,  bridges,  etc.  This  knowledge  includes  geodetic 
position,  shape,  elevation,  composition  and  spectral  properties.  In  the 
second  case,  the  position,  orientation,  and  scale  are  constrained 
whereas  in  the  first,  only  the  sc. ...  can  be  determined.  In  both  cases,  in 


order  to  operationalize  spatial  knowledge  for  the  analysis  of  a 
particular  image,  a  map-to-image  correspondence  is  performed.  In  this 
paper  we  discuss  the  implementation  and  performance  of  a  region- 
based  scgmcntation/fcaturc  extraction  program  which  has  been 
integrated  to  use  these  map  constraints  to  guide  segmentation. 

Application  areas  for  this  approach  include  digital  mapping,  remote 
sensing,  and  situation  assessment  More  specifically,  tasks  that  can 
capitalize  on  a  priori  map  knowledge  to  constrain  ’where  to  look’  and 
’what  to  look  for’  may  provide  sufficient  context  for  inherently  weak 
methods  of  feature  extraction  to  be  effective.  Rather  than  looking  for 
"perfect"  segmentations,  our  approach  extracts  segments  characterized 
as  "islands  of  reliability"  for  some  particular  instance  or  class  of  object. 
These  local  regions  can  be  further  analyzed  by  modules  that  bring  to 
bear  more  task-specific  or  object-specific  knowledge  to  confirm  or 
refute  the  initial  analysis. 

In  Section  3  we  discuss  the  organization  of  machinesf.g  and 
describe  various  constraints  that  can  be  applied  during  region-growing. 
In  Section  4  several  examples  illustrate  the  capabilities  of  map-guided 
segmentation  uring  map-based  template  descriptions  and  generic 
feature  descriptions. 
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Figure  I:  Map-Guided  Feature  Extraction 
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3.  MACHINESEG:  Region-Based  Segmentation 
Using  Constraints 

In  this  section  wc  describe  the  implementation  and  control  of  a 
egion-based  image  segmentation  program  that  utilizes  shape  and 
spectral  constraints  to  control  the  merging  and  selection  of  primitive 
regions.  These  constraints  can  be  specified  by  an  interactive  user,  or 
by  another  program. 

3.1.  Region  Growing  as  Segmentation 

Region  growing  is  a  well  understood  technique  in  image  processing 
and  computer  vision  for  providing  a  region-based  segmentation  of  an 
image.  Ballard  and  Brown3  provide  a  good  introduction  and 
Yakimovsky*  gives  a  detailed  treatment. 

One  problem  with  region  growing  is  that  it  is  difficult  to  know  when 
to  stop  merging  regions  together.  Standard  techniques  involve 
thresholds  on  edge  strength  between  regions  and/or  on  merge 
compatibility  based  on  spectral  similarity.  These  thresholds  may  be 
difficult  to  select,  especially  if  one  requires  robust  behavior  over  many 
different  images  at  a  variety  of  resolutions.  If  we  merge  until  there  is 
only  one  region  left,  wc  have  obviously  gone  too  far.  If  we  stop  the 
process  based  on  counting  merge  iterations  or  other  ad  hoc 
considerations,  some  regions  may  have  merged  into  a  stable  state, 
while  others  will  still  be  in  several  pieces  or  may  have  already  been 
merged  into  the  background.  This  problem  is  caused  by  the  fact  that 
semantically  meaningful  features  will  not  necessarily  have  good  edges. 
The  underlying  assumption  is  that  regions  of  (nearly)  homogeneous 
intensity  in  the  image  correspond  to  objects  or  surfaces  which  are 
physically  realized  in  the  scene.  However,  as  we  well  know,  some 
regions  will  have  weak  edges,  because  they  do  not  differ  much  from 
the  background,  and  edges  can  exist  where  there  is  really  no  boundary 
between  objects.  Shadows,  highlights  and  occlusions  also  violate  this 
assumption  and  complicate  the  processing  of  aerial  imagery. 

Wc  therefore  introduce  a  method  for  stopping  the  merging  of 
specific  regions  rather  than  trying  to  determine  when  the  entire  image 
segmentation  should  be  terminated.  The  approach  wc  have  taken  is 
different  from  classical  segmentation  in  that  wc  do  not  necessarily 
break  up  the  image  into  disjoint  regions  so  that  each  pixel  is  part  of 
some  region.  Rather,  wc  have  developed  a  method  for  finding  features 
that  meet  some  specific  criteria.  By  changing  the  selection  criteria  it  is 
possible  to  assign  more  than  one  region  interpretation  to  a  pixel  in 
different  executions  of  the  region-merging  process  Criteria  can  be 
used  to  look  for  features  whose  exact  shape  is  unknown  but  can  still  be 
characterized  genetically.  For  example,  if  wc  want  to  find  roads  we 
can  look  for  linear  features.  If  we  are  segmenting  an  airport  scene,  we 


may  find  large  grassy  areas  between  runways  by  looking  for  blob 
features  of  some  minimum  size.  To  find  more  complex  shapes,  such  as 
specific  buildings,  we  use  template  matching.  In  the  following  section 
we  discuss  the  merging  algorithm  that  allows  us  to  search  for  linear 
regions,  compact  regions,  blob  regions,  or  to  match  regions  to  a  shape 
template. 

3.2.  The  Algorithm 

The  basic  procedure  for  producing  regions  which  satisfy  a  particular 
set  of  criteria  is  as  follows: 

1.  Interactive  user  or  program  invokes  MACHINESEG  with  an 
image  name  and  sub-image  area. 

2.  The  sub-image  area  is  smoothed  using  edge-preserving 
smoothing5. 

3.  Edge  extraction  is  performed,  and  seed  regions,  primitive 
8-conncctcd  regions  of  homogeneous  image  intensity,  are 
produced. 

4.  A  "state"  file  containing  the  names  of  the  intermediate 
images,  edge  and  region  data  structures  arc  saved.  We 
make  use  of  this  file  to  restore  the  initial  state  of  primitive 
regions  when  changing  criteria.  This  is  discussed  in  Section 
3.5. 

5.  Match  criteria  arc  selected  using  map-based  generic  or 
template  descriptions  as  described  in  Figure  1. 

6.  Regions  arc  merged  based  on  the  strength  of  the  edges 
between  regions.  Resulting  merged  regions  arc  evaluated 
and  marked  for  special  handling  if  they  satisfy  the  criteria. 

We  store  a  list  of  the  edges  between  regions,  sorted  by  the  strength 
of  the  edge.  Wc  simply  scan  down  the  list  of  edges,  starting  with  the 
weakest,  and  merge  the  two  regions  that  share  that  edge.  Kach  time  a 
new  region  is  created  by  merging  two  other  regions,  the  new  region  is 
"scored"  against  the  specified  set  of  area,  intensity,  and  shape  criteria 
to  determine  if  it  is  similar  to  the  prototype  region  wc  arc  looking  for. 
If  it  is  similar  enough,  wc  mark  the  merged  region.  Users  can  specify 
that  after  a  region  has  been  marked,  it  will  not  be  merged  any  further 
unless  the  resulting  region  would  improve  the  criteria  "score"  or, 
alternatively,  if  it  would  simply  meet  the  criteria.  Meeting  the  criteria 
allows  newly  merged  regions  to  get  locally  worse  scores  in  order  to 
permit  future  merge  operations.  If  the  resulting  merge  must  improve 
the  criteria,  the  region  score  must  monotonically  increase  with  each 
merge.  Various  high-level  strategics  may  select  the  appropriate 
evaluation  method.  For  example,  in  template  matching,  we  require 
that  merges  improve  the  score  since  this  helps  prevent  small 
appendages  from  being  merged  with  the  feature.  In  looking  for  linear 
features,  wc  perform  any  merge  as  long  as  the  resulting  feature  would 
still  meet  the  criteria. 

The  underlying  idea  behind  our  region  merging  scheme  is  that,  if  a 
feature  exists  with  the  characteristics  we  are  looking  for,  a  significant 
portion  of  that  feature  will  eventually  be  merged  into  a  single  region. 
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We  then  stop  the  merging  of  that  feature  with  other  regions  if  the 
merge  would  not  maintain  or  improve  the  region  model.  We  make  the 
usual  assumptions  that  the  features  we  arc  trying  to  extract  will  have 
good  edges.  As  long  as  the  edges  between  the  object  and  the 
background  material  arc  stronger  than  the  edges  between  the 
subregions  of  the  object  itself,  this  method  will  work  reasonably  well. 
If  the  edges  between  the  background  and  the  region  are  weak  (ie.  the 
average  intensities  do  not  differ  by  much)  this  technique  will  not 
perform  better  than  classical  techniques.  Figure  2  shows  the  region 
merge  evaluation  loop  performed  by  maciiinuseg. 

3.3.  Evaluation  Criteria 

Different  criteria  can  be  set  by  a  user  or  by  some  high-level  process, 
such  as  the  map  database,  to  determine  what  types  of  regions  to  search 
for.  We  can  look  for  regions  within  a  certain  range  of  average 
intensity,  area,  compactness,  linearity,  or  by  matching  regions  with  a 
specific  2D  shape.  For  example,  when  searching  for  tarmac  regions  in 
airport  scenes  we  use  combinations  of  these  criteria.  In  the  following 
sections  we  discuss  the  currently  implemented  selection  criteria. 


3.3.1.  Average  Intensity 

Average  intensity  of  the  region  is  the  weakest  constraint  We  have 
mainly  applied  this  criteria  to  identify  possible  shadow  regions  using 
an  analysis  of  the  image  intensity  histogram  to  specify  criteria.  To 
select  regions  of  a  specific  intensity,  the  uscr/proccss  specifics  a  range 
in  which  the  average  intensity  of  the  region  must  fall. 

3.3.2.  Area 

Area  is  simply  the  number  of  pixels  in  the  region.  The  area  measure 
can  be  used  by  itself  to  find  background  ar  as  which  often  appear  as 
large,  homogeneous  regions  immediately  after  the  seed  regions  are 
grown.  The  area  criteria  is  most  often  combined  with  other  metrics 
such  as  linearity  and  compartncss  to  avoid  compulation  on  regions  that 
arc  too  small  or  large.  To  select  regions  of  a  specific  area,  the 
user/process  specifics  a  range  of  acceptable  region  areas. 

3.3.3.  Compactness 

The  compactness  of  a  region  is  defined  by 

4w  x  area 
compact  per.mele fl 

By  using  a  low  compactness,  we  can  find  blob  features,  or  features  that 
are  roughly  circular.  Features  with  high  compactness  are  candidates 
for  man-made  structures.  To  limit  regions  using  compactness,  the 
uscr/process  specifies  a  compactness  range  which  is  acceptable. 
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3.3.4.  Linearity 

The  linearity  measure  is  an  heuristic  designed  to  give  high  values  for 
long,  narrow  features  and  lower  values  for  other  shapes.  For 
rectangular  features,  the  linearity  is  approximately  equal  to  the  length- 
to-width  ratio,  independent  of  orientation.  Thus,  this  measure  can  not 
only  detect  linear  features,  but  also  gives  some  measure  of  how  linear 
the  feature  is.  To  use  the  linearity  criterion,  a  user  specifics  a 
minimum  linearity.  Regions  with  a  linearity  greater  than  or  equal  to 
the  value  specified  are  then  classified  as  being  linear. 


We  use  the  length  and  width  of  the  bounding  box  of  the  region,  its 
area  and  perimeter  to  compute  the  lineai  1.;  measure.  If  the  region  is  a 
narrow  rectangle,  it  will  lie  diagonally  in  its  bounding  box  and  its 
length  will  be  approximately 

length  =  VJTlIRH'  +  MIIRW ! 


where  MI1RH  and  MRRW  arc  the  height  and  width  of  the  bounding 
box.  Still  assuming  the  region  is  a  rectangle,  we  can  compute  its  width 
as 


width  = 


area 

length 


The  length-to-width  ratio  is  therefore 


length  _  length  _  length a 
width  ~  area/  length  ~  area 


MDRH'  +  MBRW1 
area 

We  use  cither  this  expression  or  it'  reciprocal,  whichever  is  larger. 
This  formula  will  give  the  length-to-width  ratio  for  regions  that  are 
rectangular.  However,  for  regions  that  are  not  rectangular,  the  result 
in  this  form  is  meaningless.  By  adding  a  further  dependence  on 
perimeter,  we  can  reduce  the  score  for  regions  that  have  appendages. 
Since  the  formula  is  designed  to  give  high  values  for  rectangles,  a 
perimeter  value  different  from  that  which  would  be  expected  for  a 
rectangle  should  decrease  the  score.  That  is,  we  will  add  a  dependence 
on  perimeter  in  such  a  manner  as  to  decrease  the  value  of  this  formula 
for  non-rectangular  regions.  The  desired  effect  can  be  achieved  by 
multiplying  by  a  correction  factor. 


correction  factors  + lenglh) 

J  perimeter 

Note  that  this  is  a  unitless  quantity.  The  value  of  this  expression  will 
be  approximately  1.0  for  a  rectangle  but  will  decrease  with 
imperfections.  The  expression  will  not  be  exactly  1.0  since  we  use 
approximate  length  and  width  as  computed  above.  For  some  shapes, 
(circles,  for  example)  the  value  of  this  expression  can  be  greater  than 
1.0.  Since  this  can  only  occur  when  the  region  is  fairly  compact,  and 
compact  regions  are  not  linear,  we  multiply  by  the  reciprocal  of  this 
expression  if  it  is  greater  then  1.0.  The  square  of  this  expression  seems 


to  give  better  results  in  practice  since  this  further  increases  the 
sensitivity  to  imperfections  -  this  also  eliminates  the  need  to  compute 
square  roots  when  the  entire  expression  for  linearity  is  expanded. 
Thus,  for  regions  that  arc  approximately  rectangular,  we  compute  the 
length-to-width  ratio.  For  other  regions,  the  score  computed  for 
linearity  is  relatively  low. 


3.3.5.  Template  Matching 

Template  matching  can  be  used  to  look  for  a  region  having  a  specific 
shape.  The  measure  computed  is  the  percentage  of  overlap  between 
the  region  being  measured  and  the  template  shape.  Ihc  template 
shape,  given  in  polygon  form,  and  the  minimum  percentage  of  overlap 
must  be  specified.  Ihc  shape  may  be  specified  cither  interactively  or 
from  a  stored  database  file.  The  template  shape  is  scan-converted  into 
a  matrix  to  simplify  the  shape  comparison  process.  Scan-conversion  of 
the  regions  is  not  necessary  since  they  arc  stored  in  image  format  as  a 
part  of  the  region  growing  process.  To  compute  overlap,  we  find  the 
centroids  of  both  regions  and  shift  the  region  matrix  so  that  the 
centroids  line  up.  Overlap  is  defined  as  the  total  number  of  pixels 
matched  from  both  regions  (ie.  twice  the  number  of  overlapping 
pixels),  divided  by  the  sum  of  the  areas  of  the  two  regions. 


overlap = 


2  X  intersection 
area  1  +  areal 


where  intersection  is  the  area  of  the  intersection  and  areal  and  areal 
are  the  areas  of  the  two  regions  under  comparison.  This  gives  an 
overlap  of  1.0  for  identical  regions.  It  also  gives  low  overlaps  for 
regions  whose  size  is  very  different,  even  if  one  of  the  regions  is  wholly 
contained  in  the  other.  For  regions  of  the  same  size,  it  will  give  scores 
in  proportion  to  the  area  of  intersection. 


This  comparison  can  be  performed  at  an  arbitrary  number  of 
orientations  spaced  at  equal  intervals;  in  some  rases  (eg.  template 
criteria)  we  know  the  orientation  approximately  and  need  only  one 
orientation.  In  other  cases,  we  may  have  a  good  model  of  the  expected 
shape,  but  have  weak  constraints  on  its  orientation.  For  multiple 
orientations,  we  compute  the  overlap  for  all  orientations  and  use  the 
maximum  value.  This  comparison  is  obviously  computationally 
expensive,  but  many  regions  may  be  excluded  from  this  operation 
simply  because  their  area  is  too  large  or  too  small  for  a  match  of  the 
desired  percentage  to  be  possible.  For  example,  if  wo  are  looking  for 
an  80  percent  overlap  with  a  feature  containing  100  pixels,  we  only 
need  to  perform  the  overlap  computation  for  regions  with  areas 
between  67  and  150  pixels.  The  performance  of  the  overlap 
computation  could  be  improved  using  alternative  formats  such  as  run¬ 
coding.  or  a  variation  of  chamfer  matching.  However,  in  the  first  case 
this  would  require  additional  storage  and  the  compulation  of  new  run- 
codes  for  merged  regions.  Additionally,  our  method  allows  for  holes 
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within  the  template  region,  which  would  complicate  the 
straightforward  run-code  algorithm  as  well  as  chamfer  matching  as 
implemented  by  a  grassfirc  algorithm. 

3.4.  Limiting  the  Search  Area 

In  addition  to  providing  the  ability  to  look  for  regions  of  specific 
shape,  other  actions  of  the  region  grower  can  be  controlled  by  higher 
level  processes.  The  region  merging  can  be  limited  to  specific  image 
sub-areas  to  improve  efficiency.  This  might  be  done  by  a  high  level 
process  whose  goal  was  to  complete  the  merging  in  a  certain  area  to 
determine  if  a  feature  was  present  This  may  be  useful  in  analysis  of 
other  areas  of  the  image  if  some  specific  information  is  known  about 
the  scene  being  segmented.  For  sub-area  merging,  the  edge  list  is 
traversed  as  usual,  but  merging  of  regions  is  disallowed  if  neither 
region  is  wholly  contained  in  the  sub-area.  Since  the  region  merging  is 
expensive,  limiung  the  search  area  can  achieve  significant  speed-up. 

3.5.  Suspension  of  Merging 

Another  form  of  high-level  control  is  the  ability  to  stop  region 
merging  at  a  specified  point  and  return  control  to  a  higher  level.  This 
can  be  done  when: 

•  Some  number  of  regions  that  fit  the  feature  criteria  are 
found. 

•  A  particular  marked  feature  is  updated  as  having  been 
extended. 

•  A  certain  number  of  merge  cycles  have  been  performed. 

This  gives  a  high-level  program  a  fine  grain  of  control  over  the 
segmentation  process  as  well  as  the  ability  to  modify  the  criteria  or 
search  in  a  small  area.  After  analysis  of  the  results  of  an  initial  region 
merge,  criteria  can  be  relaxed  or  made  more  restrictive,  based  on  the 
goals  of  segmenter.  Merging  may  be  rcsiartcd  from  the  inital  seed 
regions,  or  resumed  from  the  point  of  suspension.  This  flexibility 
allows  us  to  implement  high-level  strategics  such  as  best-first  or 
bottom -up  propagation  of  weak  hypotheses.  Similar  control  over 
parameters  by  evaluation  procedures  arc  described  by  Sclfridge^. 

4.  Some  Examples 

The  following  examples  illustrate  map-guided  scgmenialion  using 
the  mac  HIM  si  o  program.  [he  first  three  examples  show  the 
extraction  of  buildings  and  natural  features  from  images  of  the 
Washington  1).C.  area  using  the  CONCKPIMAH  database.  Ihe  final  two 
examples  show  the  use  of  map-derived  size  and  shape  criteria  to  find 
instances  of  generic  objects  in  an  image  of  National  Airport. 

4.1 .  Map  Guided  Template  Matching 

The  following  three  examples  were  generated  using  the 
CONC.r  PTMAP  program  This  program  allows  a  user  to  specify  a  feature 
in  the  database  and  an  image  in  which  to  look  for  the  feature.  The 
program  then  creates  a  template  feature  using  the  map  description  in 


the  database  and  the  map  correspondence  for  the  image  given.  The 
template  feature  description  determines  both  the  area  to  segment  in 
and  the  shape  to  look  for.  CONCEPTMap  invokes  the  MACHlNESBG 
program  to  find  a  feature  of  a  specific  shape  within  a  small  context 
area  of  the  image.  The  regions  shown  in  Figures  3,4,  and  5  were 
extracted  using  a  match  score  of  0.8  (eighty  percent).  The  context  area 
was  approximately  twice  the  size  of  the  predicted  feature.  Using  a 
small  area  helps  to  reduce  false  alarms  from  similarly  shaped  features 
in  the  same  area.  This  is  usually  only  a  problem  in  lower  resolution 
images. 

Figure  3  shows  the  results  of  processing  for  the  feature  Kennedy 
Center  in  five  different  images.  Image  patches  labeled  DC1013  and 
DC1109  arc  digitized  from  aerial  photographs  taken  at  scale  1:60000, 
1X1420  was  taken  at  scale  1:36000,  and  DC38618  and  DC38617*  were 
Liken  at  scale  1: 12000.  In  these  scales,  one  pixel  is  about  equal  to  5,  3 
and  1  meters  square,  respectively.  I  he  image  labeled  1X38617*  had 
been  segmented  by  hand  to  create  the  database  feature  used  for 
matching.  In  the  lower  resolution  images,  the  contrast  is  rather  poor, 
but  large  portions  of  the  feature  were  still  detected.  In  the  higher 
resolution  images,  the  roof  of  the  Kennedy  Center  is  not 
homogeneous.  In  these  images,  the  feature  is  not  merged  together  into 
a  single  region  that  matches  the  shape  specified  until  fairly  late  in  the 
merging  process  Ihe  tail  on  the  feature  in  1X38618  is  a  piece  of 
sidewalk  that  was  merged  into  part  of  the  building  before  the  feature 
was  merged  together. 

Figure  4  shows  the  results  of  processing  on  the  feature  Executive 
Office  Building  in  four  different  images.  One  of  the  images  of  the 
feature  is  shown  on  the  left  with  the  segmentation  rcsull  overlayed  and 
appearing  as  a  dark  oulline.  On  the  right  arc  the  outlines  of  the 
predicted  feature  shapes,  the  extracted  features,  and  the  superpositions 
of  the  predicted  and  extracted  features,  showing  their  relative  positions 
in  the  image.  Note  the  recovery  from  a  significant  correspondence 
error  in  one  of  the  examples.  The  resolution  for  each  image  is  given  on 
the  far  right. 

Figure  5  shows  the  results  of  processing  for  the  feature  McMillan 
Reservoir  in  four  different  images.  One  image  of  the  feature  is  shown 
on  the  left  with  its  segmentation  result  overlayed  and  appearing  as  a 
dark  outline.  In  this  image,  part  of  the  feature  is  not  v  isible  since  the 
feature  is  on  the  edge  of  the  image  and  is  clipped.  When  this  happens, 
the  map-to-image  correspondence  of  the  database  feature  onto  the 
image  results  in  a  template  feature  clipped  to  the  image  bounds.  The 
resulting  shape  is  approximately  the  same  shape  as  that  in  the  image  to 
be  segmented.  The  accuracy  of  locating  the  partial  feature  is  usually 


252 


the  same  as  for  location  of  the  whole  feature.  On  the  right  and  bottom 
of  figure  5  are  the  outlines  of  the  predicted  feature  shapes  for  the  other 
three  images  along  with  the  extracted  features,  and  the  superpositions 
of  the  predicted  and  extracted  features,  showing  their  relative  positions 
in  the  image.  The  superposition  of  the  predicted  and  extracted  shape 
is  not  shown  for  the  example  at  the  bottom  due  to  space  limitations. 
Ihc  region  at  the  bottom  of  the  figure  was  also  on  the  edge  of  the 
image  except  in  this  case  almost  all  of  the  feature  was  off  of  the  image. 
The  resolution  for  each  image  is  given  on  the  far  right. 

4.2.  Using  Generic  Descriptions 

In  addition  to  the  use  of  specific  map  feature  templates, 
MAClilNESEG  can  be  used  to  find  regions  having  generic  shape  or 
spectral  properties.  Figure  6  is  a  photograph  of  the  terminal  building 
area  at  National  Airport.  Washington  D.C..  We  have  been  using 
MAClilNESEG  to  provide  region  candidates  to  a  rule-based  system  for 
photo-interpretation,  spam7.  Figures  7  and  8  show  line  drawings  of 
the  regions  extracted  from  Figure  6.  The  criteria  for  Figure  7  were 
established  to  produce  large  blob  regions,  which  might  correspond  to 
tarmac,  grassy  areas  between  runways,  or  parking  lots.  A  histogram  of 
initial  seed  region  areas  was  used  to  select  an  area  criteria  based  on  the 
distribution  of  large  initial  regions.  Since  we  were  searching  for  blob 
regions,  a  compactness  criteria  which  excluded  compact  regions  was 
selected. 

The  interaction  between  spam  and  MACHINESEG  is  an  example  of  a 
high-level  process  generating  tasks  for  low-level  image  processing. 
Since  spam  maintains  models  of  its  current  view  of  the  state  of  the 
airport  interpretation,  it  can  predict  image  sub  areas  within  which 
particular  features  may  be  found  and  shape  criteria  for  those  features. 
For  example,  when  a  long  linear  region  is  produced  by  MAClilNESEG, 
several  hypotheses  (interpretations)  may  be  produced,  such  as  runway, 
taxiway,  access  road,  shadow  region,  etc.  In  order  to  verify  these 
hypotheses,  spam  may  invoke  MAClilNESEG  to  attempt  to  extend  a 
linear  region  at  cither  of  its  ends.  Since  the  location  of  the  feature 
endpoints,  width,  and  other  shape  attributes  are  known,  criteria  can  be 
specified  which  constrain  region  merging  to  an  image  sub-area  looking 
for  new  regions  with  similar  properties. 

Thus,  Figure  8  shows  the  results  of  MAClilNESEG  segmentation  using 
linear  and  small  area  criteria,  the  number  of  linear  regions  produced 
can  be  controlled  by  increasing  the  linearity  criteria  to  make  it  more 
selective.  While  this  segmentation  is  not  perfect,  it  docs  give  a  good  set 
of  candidate  regions  for  high  level  processing.  The  length,  width  and 
area  ranges  can  be  specified  using  ground  distances  (ic.  meters,  feet) 


and  transformed  automatically  into  pixel  distances  using  map-to-image 
correspondence.  Current  generic  feature  criteria  include  runways, 
taxiways.  access  roads,  parking  lots,  grassy  areas,  tarmac,  hangars,  and 
terminal  buildings. 

5.  Conclusion 

In  this  paper  we  describe  maciiineseg,  a  program  that  performs 
map-guided  image  segmentation.  The  use  of  shape  and  spectral  criteria 
to  control  merging  of  regions  within  a  region-growing  paradigm  is 
discussed.  F.xamplcs  of  the  use  of  a  feature  description  from  a  map 
database  to  guide  feature  segmentation  from  an  image  database  using 
explicit  map-to-image  correspondence  are  presented.  The  use  of 
generic  map-based  descriptions  of  shape  find  instances  of  classes  of 
objects  is  presented.  This  program  has  been  integrated  into  the  Maps 
system  and  uses  the  conceptmap  database  as  a  source  of  feature 
descriptions,  it  is  also  used  as  a  component  of  a  rule-based  system 
(spam)  for  photo-interpretation. 
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Abstract 


One  ot  the  best  definitions  of  early  vision  is  that  it  is  inverse 
optics  —  a  set  of  computational  problems  that  both  machines  and 
biological  organisms  have  to  solve.  While  in  classical  optics  the 
problem  is  to  determine  the  images  of  physical  objects,  vision 
is  confronted  with  the  inverse  problem  of  recovering  three- 
dimensional  shape  from  the  light  distribution  in  the  image.  Most 
processes  of  early  vision  such  as  stereomatching,  computation  of 
motion  and  all  the  “structure  from"  processes  can  be  regarded 
as  solutions  to  inverse  problems.  This  common  characteristic 
of  early  vision  can  be  formalized:  most  early  vision  problems 
are  "ill  posed  problems "  in  the  sense  ol  Hadamard.  We  will 
show  that  a  mathematical  theory  developed  for  regularizing 
ill-posed  problems  leads  in  a  natural  way  to  the  solution  of  early 
wit-inn  problems  in  terms  of  variational  principles  of  a  certain 
class.  This  is  a  new  theoretical  framework  for  some  of  the 
variational  solutions  already  obtained  in  the  analysis  of  early 
vision  processes.  It  also  shows  how  several  other  problems  in 
early  vision  can  be  approached  and  solved. 

Variational  Solutions  to  Vision  Problems 

In  recent  years,  the  computational  approach  to  vision  has  begun 
to  shed  some  light  on  several  specific  problems.  One  of  the 
recurring  themes  of  this  theoretical  analysis  is  the  identification 
of  physical  constraints  that  make  a  given  computational  problem 
determined  and  solvable.  Some  of  the  early  and  most  successful 
examples  are  the  analyses  of  stereomatching  (Marr  and  Poggio. 
19/6,  1979;  Grimson,  1981a, b;  Mayhew  and  Frisby,  1981;  Kass, 
1984;  for  a  review  see  Nishihara  and  Poggio,  1904)  and  structure 
from  motion  (Ullman.  1979).  More  recently,  variational  principles 
have  been  used  to  introduce  specific  physical  constraints.  For 
instance,  visual  surface  interpolation  can  be  derived  from  the 
minimization  of  functionals  that  embed  a  generic  constraint  of 
smoothness  (Grimson,  1981b,  1982;  Terzopoulos.  1983,  1984a). 
Computation  of  visual  motion  can  be  successfully  performed 
by  finding  the  smoothest  velocity  field  consistent  with  the  data 
(Horn  and  Schunck,  1981;  Hildreth,  1984a, b)  and  shape  can  be 
recovered  from  shading  information  in  terms  of  a  variational 
method  (Ikeuchi  and  Horn,  1981).  The  computation  of  subjective 
contours  (Ullman,  1976;  Brady  et  at.,  1980;  Horn,  1981),  of 
lightness  (Horn,  1974)  and  of  shape  from  contours  (Barrow 
and  Tennenbaum,  1981;  Brady  and  Vuille,  1984)  can  also 
be  formulated  in  terms  of  variational  principles.  Terzopoulos 
(1984a,  1985)  has  recently  reviewed  the  use  of  a  certain  class 
of  variational  principles  in  vision  problems  within  a  rigorous 
theoretical  framework. 


We  wish  to  show  that  these  variational  principles  follow  in  a 
natural  and  rigorous  way  from  the  ill-posed  nature  of  early 
vision  problems.  We  will  then  propose  a  general  framework  for 
"solving"  many  of  the  processes  of  early  vision. 


Ill-Posed  Problems 

In  1923,  Hadamard  defined  a  mathematical  problem  to  be 
well-posed  when  its  solution 
(a)  exists 
fhl  is  unique 

(cj  depends  continuously  on  the  initial  data  (this  means  that  the 
solution  is  robust  against  noise). 

Most  of  the  problems  of  classical  physics  are  well-posed,  and 
Hadamard  argued  that  physical  problems  had  to  be  well  posed. 
"Inverse"  problems,  however,  are  usually  ill-posed.  Inverse 
problems  can  usually  be  obtained  from  the  direct  problem  by 
exchanging  the  role  of  solution  and  data.  Consider,  for  instance, 

y  =  Ax  0) 

where  A  is  a  known  operator.  The  direct  problem  is  to  determine 
y  from  j,  the  inverse  problem  is  to  obtain  *  when  y  ("the  data") 
are  given.  Though  the  direct  problem  is  usually  well-posed,  the 
inverse  problem  is  usually  ill  posed,  when  z  and  y  belong  to  a 
Hilbert  space1. 

Typical  ill-posed  problems  are  analytic  continuation,  backsolving 
the  heat  equation,  superresolution,  computer  tomography,  imago 
restoration  and  the  determination  of  the  shape  of  a  drum  from 
its  frequency  of  vibration,  a  problem  which  was  made  famous 
by  Marc  Kac  (1966).  In  early  vision,  most  problems  are  ill-posed 
because  the  solution  is  not  unique  (but  see  later  the  case  of 
edge  detection).2 


Regularization  Methods 

Rigorous  regularization  theories  for  "solving"  ill-posed  problems 
have  been  developed  during  the  past  years  (see  especially 
Tikhonov,  1963;  Tikhonov  and  Arsenin,  1977;  and  Nashed,  1974, 
1976).  The  basic  idea  of  regularization  techniques  is  to  restrict 
the  space  of  acceptable  solutions  by  choosing  the  fut  ction 
that  minimizes  an  appropriate  functional.  The  regularization  of 
the  ill-posed  problem  of  finding  s  from  the  data  y  such  that 
.4;  -  y  requires  the  choice  of  norms  ||-||  (usually  quadratic) 
and  of  a  stabilizing  lunclional  ||/'s||.  The  choice  is  dictated 
by  mathematical  considerations,  and,  most  importantly,  by  a 
physical  analysis  of  the  generic  constraints  on  the  problem. 
Three  main  methods  can  then  be  applied  (see  Bertero,  1982): 
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I)  Among  ;  that  satisfy  ||/'z||  <  C-where  C  is  a  constant-,  find 
5  that  minimizes 


(2) 

II)  Among  z  that  satisfy  ||/U  -  y||  <  C,  find  i  that  minimizes 

IU**I|.  (3) 

III)  Find  r.  that  minimizes 

|M*-y||*  +  X||P*||,1  (4) 

where  X  is  a  regularization  parameter. 

The  first  method  consists  of  finding  the  function  j  that  satisfies 
the  constraint  ||/’z||  <  C  and  best  approximates  the  data.  The 
second  method  computes  the  function  z  that  is  sufficiently  close 
to  the  data  (C  depends  on  the  estimated  errors  and  is  zero 
if  tho  data  are  noiseless)  and  is  most  “regular".  In  the  third 
method,  the  regularization  parameter  X  controls  the  compromise 
between  the  degree  of  regularization  of  the  solution  and  its 
closeness  to  the  data.  Regularization  theory  provides  techniques 
to  determine  the  best  X  (Tikhonov  and  Arsenin,  1977;  Wahba, 
1980).  It  also  provides  a  large  body  of  results  about  the  form 
of  the  stabilizing  functional  P  that  ensure  uniqueness  of  the 
result  and  convergence.  For  instance,  it  is  possible  to  ensure 
uniqueness  in  the  case  of  Tikhonov's  stabilizing  functionals  (also 
called  stabilizers  of  p-lh  order )  defined  by 

(5) 

r=0  \  S  / 

where  r(£)  are  positive  weighting  factors.  Equation  (5)  can  be 
extended  in  the  natural  way  to  several  dimensions.  If  one  seeks 
regularized  solutions  of  eq.(1)  with  /■  given  by  eq.  (5)  in  the 
Sobolev  space  Wr,  of  functions  that  have  square-integrable 
derivatives  up  to  ;>-th  order,  the  solution  can  be  shown  to  be 
unique  (up  to  the  null  space  of  /’),  if  A  is  linear  and  continuous. 
This  is  because  for  every  ;i  the  space  IV,  .s  a  Hilbert  space 
and  ||/’;||-  is  a  quadratic  functional  (see  theorem  t,  Tikhonov 
and  Arsenin.  1977;  p.  63).  It  turns  out  that  most  stabilizing 
functionals  used  so  far  in  early  vision  are  of  the  Tikhonov  type 


(a)  (b) 

Figure  1.  Decomposition  and  ambiguity  of  the  velocity  field, 
a)  The  local  velocity  vector  V(.»)  in  the  image  plane  is  decomposed 
according  to  eq  (6)  into  components  perpendicular  and  tangent  to 
the  curve,  b)  Local  measurements  cannot  measure  the  full  velocity 
field:  the  circle  undergoes  pure  translation:  the  arrows  represent  the 
perpendicular  components  of  velocity  that  can  be  measured  from  the 
images.  From  Hildreth,  1984a. 


(see  also  Terzopoulos,  1984a. b).3  They  all  correspond  to  either 
interpolating  or  approximating  splines  (for  method  II  and  method 
III,  respectively). 


Example  I:  Motion 

Our  first  claim  is  that  variational  principles  introduced  recently  in 
early  vision  for  the  problem  of  shape  from  shading,  computation 
of  motion,  and  surface  interpolation  are  exactly  equivalent  to 
regularization  techniques  of  the  type  we  described.  The  as¬ 
sociated  uniqueness  results  are  directly  provided  by  regulariza¬ 
tion  theory.  We  briefly  discuss  the  case  of  motion  computation 
in  its  recent  formulation  by  Hildreth  (1984a, b). 

Consider  the  problem  of  determining  the  two-dimensional  velocity 
field  along  a  contour  in  the  image.  Local  motion  measurements 
along  contours  provide  only  the  component  of  velocity  in  the 
direction  perpendicular  to  the  contour.  Figure  1  shows  how  the 
local  velocity  vector  V(»)  is  decomposed  into  a  perpendicular 
and  a  tangential  component  to  the  curve 

V(.,)  =  »t(x)T(«)  +  »-L(.,)N(s)  (8) 

The  component  «-*■  and  direction  vectors  T(»)  and  N(.i),  are 
given  directly  by  the  initial  measurements,  the  "data".  The 
component  uT(.i)  is  not  and  must  be  recovered  to  compute 
the  lull  two-dimensional  velocity  field  V(*).  Thus  the  “inverse" 
problem  of  recovering  V(.i)  from  the  data  «J  is  ill-posed  because 

the  solution  is  not  unique.  Mathematically,  this  arises  because 
the  operator  K  defined  by 


ti-f-  =  ffV  (7) 

is  not  injective.  Equation  (7)  describes  the  imaging  process  as 
applied  to  the  physical  velocity  field  V  which  consists  of  the  x 
and  y  components  of  the  velocity  field  on  the  image  plane. 

Intuitively,  the  set  of  measurements  given  by  «-*-(- )  over  an 
extended  contour  should  provide  considerable  constraint  on 
the  motion  of  the  contour.  An  additional  generic  constraint, 
however,  is  needed  to  determine  this  motion  uniquely.  For 
instance,  rigid  motion  on  the  plane  is  sufficient  to  determine  V 
miquely  but  is  very  restrictive,  since  it  does  not  cover  the  case 
of  motion  of  a  rigid  object  in  space.  Hildreth  suggested,  following 
Horn  and  Schunck  (1981),  that  a  more  general  constraint  is 
to  find  the  smoothest  velocity  field  among  the  set  of  possible 
velocity  fields  consistent  with  the  measurements.  The  choice 
of  the  specific  form  of  this  constraint  was  guided  by  physical 
considerations  —  the  real  world  consists  of  solid  objects  with 
smooth  surfaces  whose  projected  velocity  field  is  usually  smooth 
—  and  by  mathematical  considerations  —  especially  uniqueness 
ol  the  solution.  Hildieth  proposed  two  algorithms:  in  the  case  of 
exact  data  the  functional  to  be  minimized  is  a  measure  of  the 
smoothness  of  the  velocity  field 

I!™" * -/(£)*  <s> 

subject  to  the  measurements  t>-L(a).  Since  in  general  there  will 
be  error  in  the  measurements  of  the  alternative  method  is 
to  find  V  that  minimizes 


ll*v-.J-||,  +  x/(fJ)V- 


(9) 
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It  is  immediately  seen  that  these  schemes  correspond  to  the 
second  and  third  regularizing  method  respectively  Uniqueness 
o*  the  solutions  (proved  by  Hildreth'  for  the  case  of  equation 
(8))  is  a  direct  consequence  for  both  equations  (8)  and  (9)  of 
standard  theorems  of  regularization  theories.  In  addition,  other 
results  can  be  used  to  characterize  how  the  correct  solution 
converges  depending  on  the  smoothing  parameter  X. 


Example  II:  Edge  Detection 

We  have  recently  applied  regularization  techniques  to  another 
classical  problem  ot  early  vision  -  edge  detection.  Edge  detection, 
intended  as  the  process  that  attempts  to  detect  and  localize 
changes  of  intensity  in  the  image  (this  definition  does  not 
encompass  all  the  meanings  of  edge  detection)  is  a  problem  of 
numerical  differentiation  (Torre  and  Poggio,  1984).  Notice  that 

differentiation  is  a  common  operation  in  early  vision  and  is  not 
restricted  to  edge  detection.  The  problem  is  illposed  because 
the  solution  does  not  depend  continuously  on  the  data.'  The 
intuitive  reason  for  the  ill  posed  nature  of  the  problem  can  be 
seen  by  considering  a  function  /(*)  perturbed  by  a  very  small 
(in  I,-.  norm)  "noise"  term  lainflz.  /(z)  and  J(t)  +  min  Ox  can 
be  arbitrarily  close,  but  their  derivatives  ma,  be  very  different  if 
II  is  large  enough. 

In  I  D,  numerical  differentiation  can  be  regularized  in  th* 
following  way.  The  "image"  model  is  m  =  /(*()  +  '<  iviiere 
j/,  is  the  data  and  <,  represent  errors  in  the  mee.  urements 
We  want  to  estimate  /’.  We  chose  a  regulars  ^  functional 
||/.y||  _  J  where  f"  is  the  second  de  native  of  /•  The 

second  regularizing  method  (no  noise  in  the  vita)  is  equivalent 
then  to  using  interpolating  cubic  splines  for  differentiation.1'  The 
third  regularizing  method,  which  is  more  natural  since  it  takec 
into  account  errors  in  the  measurement;.,  leads  to  the  variatignal 
problem  of  minimizing 

-/(*■»’ +  x/  (n*))’<k-  c°) 

Poggio  et  al.  (1984))  have  shown  fa)  that  the  solution  of  this 
problem  /  can  be  ot  ained  by  convolving  the  data  y,  (assumed 
on  a  regular  grid)  with  a  convolution  filter  It,  and  (b)  that  the  filter 
It  is  a  cubic  spline7  with  a  shape  very  close  to  a  Gaussian  and 
a  size  controlled  by  the  regularization  parameter  X  (see  figure 
2).  Differentiation  can  then  be  accomplished  by  convolution  cf 
the  data  with  the  appiopriate  derivative  of  this  filter.  The  optimal 
value  of  X  can  be  determined  for  instance  by  cross  validation 
and  other  techniques.  This  corresponds  to  finding  the  optimal 
scale  of  the  filter.8 

These  results  can  be  directly  extended  to  two  dimensions  to  cover 
both  edge  detection  and  surface  interpolation  and  approximation. 
The  resulting  filters  are  very  similar  to  two  of  the  edge  detection 
filters  derived  and  extensively  used  in  recent  years  (Marr  and 
Hildreth  1980;  Canny,  1983;  see  Torre  and  Poggio,  1984). 

Other  problems  in  early  vision  such  as  shape  from  shading 
(Ikeuchi  and  Horn,  1981)  and  surface  interpolation  (Grimson, 
1981b  1982;  Terzopoulos,  1983,  1984)  ,  in  addition  to  the 
computation  of  velocity,  have  already  been  formulated  and 
"solved"  in  similar  ways  using  variational  principles  of  the  type 
suggested  by  regularization  techniques  (although  this  was  not 
realized  at  the  time).  It  is  also  clear  that  other  problems  such 
as  stereo"  and  structure  from  motion10  can  be  approached  in 
terms  of  regularization  analysis. 


Physical  Plausibility  of  the  Solution 


Uniqueness  of  the  solution  of  the  regularized  problem— which  is 
ensured  by  formulations  such  as  equations  (2)  (4)  -  is  not  the  only 
(or  even  the  most  relevant)  concern  of  regularization  analysis. 


Figure  2.  The  edge  detection  filter,  a)  The  convolution  filter 
obtairx  j  by  regularizing  the  ill  posed  problem  of  edge  detection  with 
meh'od  (III)  (see  Poggio  et  al.,  1984).  It  is  a  cubic  spline  (solid  line), 
ve.y  similar  to  a  gaussian  (dotted  line),  b)  The  first  derivative  of 
he  filter  for  dilterent  values  of  the  regulanzmg  parameter  X,  which 
effectively  controls  the  scale  of  the  filler.  This  one  dimensional  profile 
car.  be  used  for  iwo-dimensional  edge  deieciiu.i  by  filieiiiuj  ihe  ..i«ye 
with  oriented  filters  with  this  transversal  crosscction  and  choosing  the 
orientation  with  maximum  response  (see  Canny.  1983). _ _ 


Physical  plausibility  of  the  solution  is  the  most  important  criterion. 
The  decision  regarding  the  choice  of  the  appropriate  stabilizing 
functional  cannot  be  made  judiciously  from  purely  mathematical 
considerations.  A  physical  analysis  of  the  problem  and  of  its 
generic  constraints  have  the  upper  hand  Regularization  theory 
provides  a  framework  within  which  one  has  to  seek  constraints 
that  are  rooted  in  the  physics  of  the  visual  world.  This  is, 
of  course,  the  challenge  of  regularization  analysis.  Conditions 
characterizing  'he  physically  conect  solutions  can  be  derived 
(for  the  case  of  motion,  see  Yuille,  1983  and  for  edge  detection, 
see  Poggio  et  al.,  1984). 

From  a  more  biological  point  of  view,  a  careful  comparison  of 
the  various  "regularization"  solutions  with  human  perception 
promises  to  be  a  very  interesting  area  of  research,  as  suggested 
by  Hildreth's  work.  For  some  classes  of  motions  and  contours, 
the  solution  of  equations  (8)  and  (9)  is  not  the  physically  correct 
velocity  field.  In  these  cases,  however,  the  human  visual  system 
also  appears  to  derive  a  similar,  incorrect  velocity  field  (Hildreth, 
1984a, b). 


Conclusion 

rhe  concept  of  ill  posed  problems  and  the  associated  regulariza- 
ion  theories  seem  to  provide  a  satisfactory  theoretical  framework 
or  part  of  ea^  vision  This  new  perspective  testifies  the  use  of 
/ariational  principles  of  a  certain  type  for  solving  specific  prob 
ems,  and  suggests  how  to  approach  other  early  vision  problems. 
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It  provides  a  link  between  the  computational  (ill  posed)  nature 
ol  the  problems  and  the  computational  structure  of  the  solution 
(as  a  variational  principle).  In  a  companion  paper  (Poggio  and 
Koch.  1984),  we  will  discuss  computational  "hardware"  that  is 
natural  for  solving  variational  problems  of  the  type  implied  by 
regularization  methods.  The  approach  can  be  extended  to  other 
sensory  modalities  and  to  some  motor  control  problems.  For  in¬ 
stance.  a  recently  proposed  solution  to  the  problem  of  executing 
a  voluntary  arm  trajectory  (Hogan.  1984)  can  be  recognized  as 
an  instance  of  our  second  regularization  technique.'* 

Despite  its  attractions,  this  theoretical  synthesis  of  early  vision 
also  shows  the  limitations  that  are  intrinsic  to  the  variational 
solutions  proposed  so  far.  and  in  any  case  to  the  simple  forms 
of  the  regularization  approach.  The  basic  problem  is  the  degree 
of  smoothness  required  for  the  unknown  function  z  that  has  to 
be  recovered.  If  z  is  very  smooth,  then  it  will  be  robust  against 
noise  in  the  data,  but  it  may  be  too  smooth  to  be  physically 
plausible.  For  instance,  in  visual  surface  interpolation,  the  degree 
of  smoothness  obtained  with  the  thin  plate  model  (from  a  specific 
form  of  equations  (4)-(5))  smoothes  depth  discontinuities  too 
much  and  often  leads  to  unrealistic  results  (but  see  Terzopoulos 
1984). 

ilie^e  piuoieins  11  lay  be  solved  by  iiiuiu  supiiislicaied  regulariza¬ 
tion  techniques,  such  as  stochastic  methods.  The  simple 
regularization  techniques  analyzed  here  rely  on  quadratic  varia¬ 
tional  principles  that  lead  to  linear  Fuler-Lagrange  equations. 
Thus  the  solution  can  be  found  by  filtering  the  data  through  an 
appropriate  linear  filter.  Analog  electrical  or  chemical  networks 
can  be  devised  for  the  specific  variational  principles  (Poggio 
and  Koch,  1984).  Again,  the  universe  of  solutions  to  quadratic 
variational  principles  is  somewhat  restricted. 

Nonquadratic  variational  principles  are.  however,  possible.  They 
may  arise  naturally  in  one  of  the  most  fundamental  problems 
in  early  vision,  the  problem  of  integrating  different  sources  of 
information,  such  as  stereo,  motion,  shape  from  shading,  etc. 
This  problem  is  ill-posed,  not  just  because  the  solution  is  not 
unique  (the  standard  case),  but  because  the  solution  is  usually 
overconstrained  and  may  not  exist.  The  use  and  extensions 
of  tools  from  regularization  theory  to  analyze  the  fusion  of 
information  from  different  sources  is  one  of  the  most  interesting 
challenges  in  the  theory  of  early  vision. IJ-M 

The  problem  is  related  to  the  deep  question  of  the  computational 
organization  of  a  visual  processor  and  its  control  structure.  It 
is  unlikely  that  variational  principles  alone  could  have  enough 
flexibility  to  control  and  coordinate  the  different  modules  of 
early  vision  and  their  interaction  with  higher  level  knowledge. 
This  also  hints  at  the  basic  limitation  of  regularization  methods 
that  makes  them  suitable  only  tor  the  first  stages  of  vision. 
They  derive  numerical  representations— surfaces— from  numeri¬ 
cal  representations— images  It  is  difficult  to  see  how  the  com¬ 
putation  of  the  more  symbolic  type  of  representations  that 
are  essential  for  a  powerful  vision  processor  can  fit  into  this 
theoretical  framework15. 

In  summary,  we  have  outlined  a  new  theoretical  framework 
that  trom  the  computational  nature  ot  early  vision  leads  to 
algorithms  tor  solving  them,  and  suggests  a  specific  class  ol 
appropriate  hardware.  The  common  computational  structure  of 
many  early  vision  problems  is  that  they  are  ill-posed  in  the  sense 
ol  Hadamard.  Regularization  analysis  can  be  used  to  solve  them 
in  terms  ot  variational  principles  ol  a  certain  type  that  enforce 
constraints  derived  trom  a  physical  analysis  ot  the  problem. 
Analog  networks— whether  electrical  or  chemical-are  a  simple 
and  attractive  way  ol  solving  this  type  ot  variational  principles. 
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Footnotes 


[1]  Whether  a  problem  is  well-  or  ill-pcsed  depends  on  the  triplet 
(A,  /..  Y )  where  '/.  and  Y  are  the  solution  and  the  data  space 
respectively. 

(2)  The  reason  for  the  lack  of  uniqueness  is  that  the  operator 
corresponding  to  A  is  usually  not  inactive,  as  in  the  case  of 
shape  from  shading,  surface  interpolation  and  computation  of 
motion. 

To  clarify  some  of  the  structure  of  ill-posed  problems,  let  us 
consider  the  linear  operator 

Az  =  y.  (t) 

If  z  and  y  are  finite  vectors,  then  the  inverse  problem  is  easily 
solved  by  finding  the  inverse  of  A.  or  its  pseudoinverse.  It  is 
well  known  that  if  A  is  a  square  matrix,  A~'  exists  if  det|/1|  yt  o. 
Now  let  us  suppose  that  z  e  ’/.  and  ye  Y,  where  7.  and  Y  are 
Hilbert  spaces.  The  inverse  problem  is  well  posed  itt  the  three 
conditions  of  Hadamard  are  satisfied.  In  particular, 

(1)  condition  (a)  of  Hadamard  is  satisfied  iff  the  range  of  A 
is  lt(A )  =  Y. 

(2)  condition  (b)  of  Hadamard  is  satisfied  itt  A  is  injective. 

(3)  condition  (c)  of  Hadamard  is  satisfied  iff  lt(A)  is  a  closed 
set. 

If  the  operator  A  is  compact  and  tl(A)  does  not  have  finite 
dimensions,  11(A)  is  open,  and  therefore  the  inverse  problem  is 
also  ill-posed. 

Most  linear  operators  whose  domain  and  co  domain  are  Hilbert 
spaces  are  compact  operators.  In  fact,  if  /■;  and  F  are  measurable, 
bounded  sets  li  e  R"  and  F  e  Rm,  and  k(t,a)  is  a  measurable 
function  defined  on  E  x  F.  then  the  linear  operatoi  A  l.-,(E)  >-» 
l.-z (/••)  defined  as 

(A*X0  =  f  *{*.  »)*{*)<*»  (2) 

is  compact  and  11(A)  has  finite  dimensions  iff  k(t,  a)  is  separable, 
i.e., 
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^  nk(i)/3k(j).  (3) 

k  =  l 

Obviously  if  /»*( .  1 )  has  finite  dimension,  then  lt(,\)  cannot  coincide 
with  >  .  and  therefore  the  inverse  problem  of  an  integral  operator 
or  a  convolution  is  in  general  an  ill-posed  problem. 

We  can  relax  condition  (2)  and  admit  the  case  that  A  is  not 
inactive  The  problem  is  then  regularized  by  introducing  an 
appropriate  norm  and  finding  the  generalized  pseudoinverse  of 
the  inverse  problem  (1). 

When  j  is  not  in  11(A),  it  is  not  easy  to  regularize  the  problem 
without  altering  the  essence  of  the  problem  itself. 

[3]  J  Canny's  (1983)  variational  formulation  can  be  derived  from 
eq  (4)  and  a  stabilizing  functional  of  the  form  of  eq.  (5)  (see 
Poggio  et  al.,  1984). 

(4]  It  is  shown  in  Hildreth  (1984a)  that  extremizing  equation  (8) 
yields  a  unique  velocity  field,  since  it  corresponds  to  minimizing 
a  positive  definite  functional  on  a  convex  set.  The  theorems 
of  du  Bois  Reymond  state  that,  provided  ^  is  continuous  the 
solution  of  the  minimization  problem  will  be  the  solution  of  the 
corresponding  Euler  Langrange  equations. 

[5j  The  problem  is  to  find  the  solution  z  to 
y  —  Az 

with  (,lz)(x)  =  /0* z(*)rt.i.  Thus,  z  is  the  derivative  of  the  data 
y.  The  problem  is  (mildly)  ill-posed  because  if  2  6  T-  [0, 1  j ,  the 
compact  operator  A  is  not  closed  in  f,s(0, 1], 

[6]  For  data  on  a  regular  grid,  it  corresponds  to  convolving  the 
data  with  tne  /,,  filters  of  Schoenberg  (1946). 

[7]  A  higher  degree  stabilizer  may  be  used  for  higher  derivatives, 
leading  to  higher  order  splines. 

[8]  Methods  such  as  the  Generalized  Cross  Validation  method 
(GCV)  (Wahba,  I960;  see  also  Reinsch,  1967)  may  be  used 
to  find  the  "optimal"  scale  of  the  filter,  i.e.,  the  optimal  X. 
tmgerpnnts  (Yuille  and  Poggio,  1983)  may  provide  a  method  for 
finding  the  optimal  value  of  the  regularization  parameter  X.  This 
follows  from  the  fact  that  the  filter  given  by  equation  (10)  is  very 
similar  to  a  Gaussian  and  that  x  effectively  controls  the  scale  of 
the  filter  (see  Poggio  et  al.,  1984). 

[9]  X  nther  clearly  ill  posed  problem  is  stereo-matching.  It  is  not 
immediately  obvious,  however,  what  the  correct  regularizing  pro¬ 
cedure  is  Berthold  Horn  has  suggested  (personal  communica¬ 
tion)  a  variational  principle  for  stereo  matching  similar  to  his 
scheme  for  computing  optical  flow  The  norm  to  be  minimized 
measures  deviations  from  smoothness  of  the  disparity  field. 
Specifically,  the  ncrm  of  the  derivative  of  the  2  component,  the 
depth  component,  has  to  be  minimized  subject  to  the  constraints 
given  by  the  data  This  can  be  regarded  again  as  a  variational 
principle  of  the  type  tnat  is  obtained  directly  using  the  standard 
regularization  methods  of  ill  posed  problems.  We  are  presently 
developing  regularization  solutions  to  the  stereo  problem  (Yuille 
and  Poggio,  in  preparation). 

The  problem  of  shape  from  contours  in  the  variational  formulation 
of  Brady  and  Yuille  (1983)  is  an  ill-posed  problem  but  the  solution 
is  not  of  the  standard  regularization  type. 

[10]  The  rubbery  constraint  proposed  by  Ullman  (1983)  is  more 
general  than  the  rigidity  constraint.  It  may  be  possible  to 
reformulate  it  according  to  regularization  techniques. 

[11]  A  method  lor  checking  physical  plausibility  of  a  variational 
principle  is,  of  course,  computer  simulation.  A  simple  technique 


we  suggest  is  to  use  the  Euler-Lagrange  equation  associated 
with  the  variational  problem. 

In  the  computation  of  motion,  Yuille  (1983)  has  obtained  the 
following  suthcient  and  necessary  condition  for  the  solution  of 
the  variational  principle  equation  (8),  to  be  the  correct  physical 
solution 


where  T  is  the  tangent  vector  to  the  contour  and  V  is  the  true 
velocity  field.  The  equation  is  satisfied  by  uniform  translation 
or  expansion  and  by  rotation  only  if  the  contour  is  polygonal. 
These  results  suggest  that  algorithms  based  on  the  smoothness 
principle  will  give  correct  results,  and  hence  be  useful  for 
computer  vision  systems,  when  (a)  motion  can  be  approximated 
locally  by  pure  translation,  rotation  or  expansion,  or  (b)  objects 
have  images  consisting  of  connected  straight  lines.  In  other 
situations,  the  smoothness  principle  will  not  yield  the  correct 
velocity  field,  but  may  yield  one  that  is  qualitatively  similar  and 
close  to  human  perception  (Hildreth,  1984a, b). 

In  the  case  of  edge  detection  (intended  as  numerical  differentiat¬ 
ion),  the  solution  is  correct  it  and  only  it  the  intensity  profile  is 
a  polynomial  spline  of  odd  degree  greater  than  three  (Poggio  et 
al.,  1984). 

[12]  The  variational  principle  (minimization  of  jerk)  corresponds 
to  the  second  regularization  method,  with  7’  =  dViix3.  The 
associated  interpolating  function  is  a  quintic  spline.  Analog 
networks  for  solving  the  problem  can  be  devised  (Poggio  and 
Koch,  1984).  It  may  be  interesting  to  consider  our  third  method 
of  regularization  in  the  context  of  the  available  data  on  arm 
trajectories. 

[13]  The  variational  principles  that  we  have  considered  so  far  for 
early  vision  processes  are  quadratic  and  lead  therefore  to  linear 
equations.  The  ill-posed  problem  of  combining  several  differs  ! 
sources  of  surface  information  may  easily  lead  to  non-quadratic 
regularization  expressions  (though  different  "non-interacting" 
constraints  can  be  combined  in  a  convex  way,  see  Terzopoulos, 
1984).  These  minimization  problems  will  in  general  have  multiple 
local  minima.  Schemes  similar  to  annealing  (Kirkp  trick,  Gelatt 
and  Vecchi  1983;  Hinton  and  Sejnowski.  1983;  Geman  and 
Geman,  1984)  may  be  used  to  find  the  global  minimum  (see  also 
Poggio  and  Koch,  1984). 

[14]  This  is  a  list  of  open  problems  on  which  we  are  presently 
working: 

a)  Regularized  solution  for  stereo  matching. 

b)  Regularized  solution  for  structure  from  motion. 

c)  Full  extension  of  the  edqt‘  detection  analysis  to  2-0  and 
application  to  stu'  ice  approximation  for  computing  differential 
properties  of  sudu  'S 

d)  Analysis  and  implementation  of  methods  for  finding  the  optimal 
regularization  parameter  X  Use  ol  fingerprints. 

e)  Connection  between  the  regularizing  parameter  X,  the  iteration 
number  in  iterative  regularizing  methods  (Nashed,  1976)  and  the 
truncation  of  a  formal  power  series  expansion  of  the  rogular,zing 
operator. 

f)  Use  of  stochastic  regularization  methods  (see  also  Geman  and 
Geman,  1984). 

[15]  But  see  Hummel  and  Zucker,  1980. 
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Abstract 

The  transformation  of  the  three-dimensional  coordinates  of 
a  point  to  the  two-dimensional  coordinates  of  its  image  can  be 
expressed  compactly  as  a  4  x  4  homogeneous  coordinate  transfor¬ 
mation  matrix  in  accordance  with  a  particular  imaging  geometry. 
The  matrix  can  either  he  derived  analytically  from  knowledge 
about  the  camera  and  the  geometry  of  image  formation,  or  it  can 
he  computed  empirically  front  the  coordinates  of  a  small  num¬ 
ber  of  three-dimensional  joints  and  their  corresponding  image 
points.  Despite  the  utility  of  the  matrix  tn  image  understanding, 
motion  tracking,  and  autonomous  navigation,  very  little  is  under¬ 
stood  about  the  inverse  problem  of  recovering  the  projection  pa¬ 
rameters  from  its  coefficients.  Previous  attempts  have  produced 
solutions  that  require  iteration  or  the  solution  of  a  set  of  simul¬ 
taneous  nonlinear  equations.  This  paper  shows  how  the  location 
and  orientation  of  the  camera,  as  well  as  the  other  parameters 
of  the  image-formation  process  can  easily  be  computed  from  the 
homogeneous  coordinate  transformation  matrix.  The  problem  is 
formulated  as  a  simple  exercise  in  constructive  geometry  and  the 
solution  is  both  noniterative  and  intuitively  understandable. 

1  Introduction 

Homogeneous  coordinates  and  the  homogeneous  coordinate  trans¬ 
formation  matrix  are  a  convenient  means  for  representing  arbi¬ 
trary  transformations,  including  perspective  projection  in  a  single 
formalism.  One  such  use  for  this  matrix  is  as  a  camera  transfor¬ 
mation  matrix  that  maps  points  in  an  object-centered  coordinate 
system  into  the  corresponding  points  in  image  coordinates  accord¬ 
ing  to  a  particular  imaging  geometry  [7],  The  camera  transfor¬ 
mation  matrix  has  seen  wide  use  in  several  disciplines.  Rogers 
and  Adams  present  numerous  applications  in  computer  graphics 
[it].  Other  fields  that  have  made  use  of  the  camera  transformation 
matrix  include  stereo  reconstruction,  robot  vision,  photogramme- 
try.  unmanned-vehicle  guidance,  and  image  understanding  [5],  [6], 
I12]-  (II]-  Several  techniques  for  computing  the  matrix  have  been 
derived,  yet  very  little  is  understood  about  how  to  recover  the 
projection  parameters  from  the  coefficients  of  the  matrix. 

When  the  location  and  orientation  of  the  camera  are  known, 
the  camera  transformation  matrix  that  models  the  image  forma¬ 
tion  process  can  easily  be  derived  analytically  [lj.  This  model 
forms  the  basis  for  subsequent  processing  of  images  produced  by 
that  camera.  On  the  other  hand,  when  the  location  and  orien¬ 
tation  of  the  camera  are  unknown,  the  parameters  of  image  for¬ 
mation  must  be  derived  from  the  correspondence  between  a  set 
of  image  features  and  a  set  of  object  features.  Images  obtained 
from  an  unknown  source  or  from  cameras  mounted  on  moving 


platforms  exemplify  contexts  in  which  the  imaging  geometry  may 
not  be  known. 

One  appro!  ch  to  computing  the  parameters  of  the  image  for¬ 
mation  process  directly  is  embodied  in  RANSAC,  developed  by 
Fischler  and  Holies  [3].  RANSAC  computes  the  camera  location 
directly  from  a  set  of  landmarks  with  known  three-dimensional 
locations  when,  in  addition,  the  focal  length  and  piercing  point 
are  known. 

Alternatively,  several  methods  exist  for  estimating  the  coeffi¬ 
cients  of  the  camera  transformation  matrix  from  the  correspon¬ 
dence  between  image  and  object  coordinates.  Sutherland  [10]  de¬ 
scribes  a  method  to  determine  the  matrix  experimentally  from 
the  image  by  using  a  least-squares  technique  to  obtain  the  coef¬ 
ficients  from  available  ground  truth  data.  A  consideration  of  the 
experimental  errors  involved  and  a  means  for  improving  accuracy 
are  described  by  Sobel  [9]. 

The  issue  addressed  in  this  paper  is  how  to  determine  the 
imaging  geometry  horn  a  ramera  transformation  matrix  that  has 
been  derived  experimentally.  For  example,  given  a  photograph 
taken  by  an  unknown  camera  from  an  unknown  location  and 
which,  moreover,  may  have  been  cropped  and/or  enlarged,  how 
can  we  recover  the  camera’s  position  and  orientation  and  deter¬ 
mine  the  extent  to  which  the  picture  was  cropped  or  enlarged?  If 
some  ground  truth  data  are  available,  an  established  least-squares 
technique  such  as  Sutherland’s  can  be  used  to  derive  the  camera 
transformation  matrix,  whereupon  the  problem  reduces  to  that  of 
computing  the  values  of  the  desired  parameters  from  the  matrix. 
Ganapathy  [4]  recently  published  the  first  noniterative  method  for 
solving  this  problem  by  posing  it  as  a  set  of  eleven  simultaneous 
nonlinear  equations  that  can  be  solved  to  obtain  the  eleven  inde¬ 
pendent  coefficients  of  the  camera  transformation  matrix.  While 
the  method  is  successful  at  solving  for  camera  location  and  orien¬ 
tation,  it  is  an  algebraic  one  that  provides  little  insight  into  the 
underlying  geometry.  The  method  to  be  described  here  is  a  geo¬ 
metric  one  that  solves  for  the  same  parameters,  but  is  posed  as  a 
simple  problem  in  constructive  geometry  and  allows  an  intuitively 
clear  derivation. 

This  w<  rk  has  immediate  application  in  several  areas: 

•  Many  algorithms  in  image  understanding  require  knowledge 
of  the  camera  parameters  These  can  be  computed  from  an 
arbitrary  photograph  by  using  the  method  presented  here 
when  ground  truth  data  is  available. 
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•  Autonomous  navigation  can  be  posed  as  a  problem  in  de¬ 
riving  camera  parameters.  A  cruise  missile,  for  instance, 
could  obtain  the  camera  transformation  matrix  from  a  ter¬ 
rain  model  stored  on  board  and  then  compute  the  camera 
parameters  that  define  the  vehicle’s  location  and  heading. 

•  A  stationary  camera  viewing  a  robot  arm  workspace  could 
determine  the  position  and  orientation  of  the  arm.  Conspic¬ 
uous  marking  of  several  points  on  a  part  of  the  manipulator 
would  allow  their  easy  extraction  from  an  image  and  pro¬ 
vide  the  ground  truth  necessary  for  Sutherland’s  algorithm 
to  ascertain  the  camera  transformation  matrix.  The  cam¬ 
era  parameters  can  be  derived  from  this  matrix,  and  the 
location  and  orientation  of  the  manipulator  can  then  be  ob¬ 
tained  relative  to  the  stationary  camera. 


2  The  Camera  Transformation  Matrix 

As  indicated  earlier,  the  camera  transformation  matrix  can 
be  used  to  model  in  a  single  formalism  the  effects  of  rotation, 
translation,  perspective,  scaling,  and  cropping — •".*.»  the  vari¬ 
ables  associated  with  the  normal  imaging  process.  Here  wc  re¬ 
view  the  fundamentals  of  homogeneous  coordinate  transforma¬ 
tions  that  are  essential  for  understanding  the  decomposition  to 
be  described.  The  presence  of  an  ideal  lens  and  the  absence  of 
any  atmospheric  distortions  are  assumed. 

The  imaging  situation  can  be  modeled  as  shown  in  Figure 
1.  The  XYZ  coordinate  system  represents  the  world  or  object- 
centered  coordinates.  The  center  of  projection  (the  location  of 
the  lens)  is  shown  as  r.  point  L  in  space.  The  image  plane  is  a 
plane  between  the  lens  and  the  object  onto  which  the  object  is 
projected  to  obtain  the  image.  Each  image  point  is  that  point  in 
the  image  plane  where  the  plane  intersects  the  line  connecting  L 
with  the  corresponding  object  point.  The  UVW  coordinate  system 
is  situated  such  that  (u,o)  are  the  image  coordinates  of  an  image 
point  and  u>  =  0  defines  the  image  plane.  The  perpendicular 
distance  between  L  and  the  image  plane  is  the  camera  focal  length, 

In  a  homogeneous  coordinate  system,  a  three-dimensional  point 
(i,y,  z)  is  represented  as  a  four-component  row  vector,  (tx,  <y,  tz,  t); 
the  three-dimensional  coordinates  are  obtained  by  dividing  through 
by  the  fourth  component.  A  point  in  the  world  is  represented  as 
a  four-component  row  vector  and  its  projection  in  the  image  is 
obtained  by  postmultiplying  by  the  4x4  camera  transformation 
matrix: 

xM  =  u 

[x,y,z,\)M  =  (su,ev,aw,a) 

This  homogeneous  coordinate  system  is  most  useful  for  modeling 
the  effects  of  perspective  projection— further  details  can  be  found 
in  Ballard  and  Brown  (l|. 

The  matrix  M  can  be  viewed  as  being  composed  of  several 
simple  transformations.  While  it  is  possible  to  decompose  the 
matrix  in  a  variety  of  ways,  the  particular  decomposition  chosen 
must  capture  all  the  degrees  of  freedom  of  the  imaging  geome¬ 
try.  The  somewhat  arbitrary  choice  used  throughout  this  paper 
is  shown  below: 

M  =  (tranalate)[rotate)(project)(acale)(crop) 

M  =  TRPSC  (1) 


FIGURE  I  THE  IMAGING  GEOMETRY 


Each  of  the  component  transformations  can  be  expressed  as  a 
4x4  matrix:  multiplying  them  together  produces  the  camera 
transformation  matrix  A/.  Details  of  the  decomposition  are  given 
below. 


2.1  Translation 

Translation  moves  the  image  plane  away  from  the  object-centered 
origin.  To  translate  the  plane  by  (xo>yo.*o)  multiply  by  the  ma- 

0  0  0' 

1  0  0 

0  1  0  ' 


2.2  Rotation 


The  orientation  of  the  camera  is  specified  by  the  rotation  ma¬ 
trix  R.  which  ran  be  further  decomposed  to  R  =  R,RyR,.  corre¬ 
sponding  to  rotation  about  each  of  the  principal  axes.  Clockwise 
rotation  by  9  about  the  N  axis  while  looking  toward  the  origin  is 
accomplished  by 


R,  = 


10  0  0 

0  cos  9  —  sin  9  0 
0  sin#  cos  9  0 

0  0  0  1 


Similarly,  clockwise  rotation  by 
is  represented  by 


cosj) 


about  the  newly  rotated  Y  axis 

0  sin  0  0 
1  0  0 
0  cos  d  0 
0  0  1 


and  rotation  by  ip  about  the  new  Z  axis  is  given  by 


Rz  = 


cos  V1  —  sin  V*  0  0 
sin  cos  V1  0  0 

0  0  10 

0  0  0  1 


The  first  two  rotations,  R,  and  Ry,  serve  to  align  the  Z  axis  with 
the  line  of  sight  defined  by  the  VV  axis.  The  final  rotation,  R„ 
is  within  the  image  plane  about  the  line  of  sight.  Together,  T 
and  R  =  RjRyR,  account  for  the  location  and  orientation  of  the 
camera. 


265 


*v.'. 


v.vV 


FIGURE  2  PERSPECTIVE  PROJECTION  AFTER  ROTATION  ANO  TRANSLATION  FIGURE  3  DETERMINING  THE  LOCATION  OF  THE  CAMERA 

i  ■  0 1>  th«  imtgc  plane. 


2.3  Projection 

P  provides  the  distortion  associated  with  perspective  projec¬ 
tion.  Figure  2  shows  the  simplified  imaging  geometry  after  trans¬ 
lation  and  rotation  have  been  accounted  for.  The  camera  location 
L  is  on  the  positive  Z  axis  at  a  distance  /  from  the  origin.  The 
image  plane  passes  through  the  origin  and  the  world  to  be  imaged 
lies  behind  the  image  plane.  Analysis  of  similar  triangles  shows 
that  the  image  coordinates 


Using  homogeneous  coordinates,  this  perspective  projection  can 
be  obtained  by  multiplying  the  homogeneous  coordinates  of  the 
world  point  by  the  matrix 


10  0  0 

0  10  0 

0  0  1  -1  // 
0  0  0  1 


The  resulting  row  vector  is  divided  through  by  the  fourth  com¬ 
ponent  to  renormalize  the  homogeneous  coordinates,  and  then 
projected  orthographically  onto  the  image  plane  to  =  0  to  yield 
the  proper  perspective  projection  of  the  world  point. 


2.4  Scaling 

The  image  coordinates  can  be  scaled  to  reflect  an  enlargement 
or  shrinking  of  the  image.  Scaling  by  ku  and  in  the  V  and  V 
directions  is  achieved  with 


0  0  0 

_  0  *„  0  0 

0  0  10’ 

0  0  0  1 

Scaling  the  IV  axis  is  meaningless  because  the  perspective  projec¬ 
tion  always  requires  that  to  =  0. 


2.5  Cropping 

The  effect  of  cropping  a  photograph  is  obtained  by  translating 
the  UV  coordinates  within  the  image  plane.  The  following  matrix 
is  used  to  shift  the  origin  by  (u0,  o0): 


1  0  0  0 

0  10  0 

0  0  10 

-u0  -t>o  0  1 


Note  that  neither  scaling  nor  cropping  affects  the  to  and  »  co¬ 
ordinates  of  the  homogeneous  image  point,  so  that  orthographic 
projection  and  renormalization  can  take  place  after  the  entire 
transformation  has  been  computed. 


3  Recovering  the  Camera  Parameters 

The  camera  transformation  matrix  M  allows  representation  of 
all  eleven  degrees  of  freedom  associated  with  the  image  formation 
process.  These  camera  parameters  are  embedded  in  the  matrix 
in  a  way  that  makes  their  determinat  n  difficult.  This  section 
presents  a  simple  method  that  recover,  the  various  parameters 
associated  with  image  formation.  Its  main  advantages  are  that  it 
is  both  noniterative  and  geometric,  enabling  a  clear  understanding 
of  the  equations  involved. 

The  matrix  M  can  be  viewed  as  a  function  that  maps  world 
coordinates  into  image  coordinates  according  to  the  constraints 
of  Figure  1.  For  notational  simplicity,  we  shall  assume  that  all 
matrix  multiplications  automatically  normalize  the  homogeneous 
coordinates  of  the  resulting  row  vector.  For  example,  u  =  xM  = 
(su,sv,sw,s)  =  (ti,  r, I ).  The  image  formation  process  can 
then  be  written  as 

(u,v,0, 1)  =  orthoproject(xM),  (2) 

where  x  is  the  homogeneous  coordinate  of  a  world  point,  and 
orlhoproject(-)  is  a  function  that  performs  an  orthographic  pro¬ 
jection  along  the  w  axis  such  that 

orthoprojcct(u,v,w,  1)  =  (u,u,0, 1). 

3.1  Location  of  the  Camera 

Figure  3  illustrates  the  technique  for  finding  the  center  of  pro¬ 
jection.  First,  compute  M~ 1  for  later  use.  Note  that  M  will 
always  be  invertible  because  all  its  components  in  Equation  1  are 
clearly  invertible.  The  location  of  the  center  of  projection,  L,  can 
be  determined  as  follows: 

Choose  an  arbitrary  world  point  Xi  =  (zi,yi,*i,  1)  and  com¬ 
pute  U|  =  X\M.  If  we  were  to  multiply  Ui  =  («ii, iq, «q,  1)  by 
A/-1,  we  would  obtain  the  original  X|.  Instead,  first  project  u, 
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FIGURE  4  DETERMINING  THE  ORIENTATION  OF  THE  CAMERA 


FIGURE  6  DETERMINING  THE  PIERCING  POINT 


to  obtain  ortlioproject( U|)  =  (tf|, V|,0, 1),  where  (ui,t>|)  are  the 
image  coordinates  of  xt.  Next,  barkproject  this  image  point  by 
multiplying  by  A/-1  to  obtain  X||.  This  specifies  another  world 
point.  Xu,  which  is  different  from  the  original  X|  but  constrained 
to  lie  along  the  line  connecting  X,  and  the  center  of  projection 

To  confirm  this,  note  that  all  points  lying  along  the  line  con¬ 
necting  X|  and  /,  are  transformed  by  M  to  points  identified  by 
(ii i ,  t'l ,  it*,  I ),  where  tr  varies  with  each  point.  The  converse  must 
also  be  true.  That  is,  for  any  w,  (ti|,fi, l)Af-1  specifies  a  point 
somewhere  along  the  line  connecting  X|  and  L. 

Repeat  the  above  process  with  another  point  x2  to  obtain  the 
point  X21 ,  which  must  lie  on  the  line  connecting  x2  and  L.  Now  Xi 
and  x„,  and  x2  and  x2,  define  two  lines  that  pass  through  L\  their 
intersection  can  be  computed  to  obtain  the  world  coordinates  of 
l,.  This  method  will  fail,  of  course,  if  either  X|  or  x2  lie  in  the 
image  plane  or  if  x, ,  x2  and  /.  are  colinear.  Because  their  choice  is 
arbitrary,  valid  points  can  always  be  found  that  allow  the  unique 
determination  of  L. 

3.2  Orientation  of  the  Camera 

The  orientation  of  the  camera  is  defined  by  the  orientation  of 
the  image  plane  (Figure  4).  The  latter  can  easily  be  established 
by  observing  that  world  points  lying  in  the  plane  that  is  parallel 
to  the  image  plane  and  that  passes  through  the  center  of  the  lens 
will  map  to  infinity  in  image  coordinates.  The  only  way  this  can 
happen  for  a  finite  world  point  is  if  the  fourth  component  of  the 
homogeneous  image  coordinate  is  zero. 

Thus,  if 

(:r,  y,  r,  I)Af  =  (u,«,tu,0), 

it  follows  that 

Mux  +  Af2,y  +  A/j,r  +  Mu  =  0, 

which  is  the  equation  of  the  plane  through  L  parallel  to  the  im¬ 
age  plane.  From  this  equation  it  is  clear  that  the  vector  n  = 
(A/m.  A/2i.  A/s,)  is  normal  to  the  image  plane  and  parallel  to  the 
camera's  direction  of  view. 

The  orientation  in  terms  of  rotations  about  the  axes  can  be 
calculated  by  using  spherical  coordinates  such  that 


1  *  14 

6  -  arrsin  ■  ■  ■  — 

y/A/f,  +  +  M;t 

where  0  is  the  clockwise  rotation  about  the  .V  axis  and  0  is  the 
clockwise  rotation  about  the  rotated  Y  axis.  The  final  rotation 
parameter,  t\  is  the  rotation  within  the  image  plane  about  the  IV 
axis.  The  magnitude  of  v  cannot  be  obtained  from  the  normal  to 
the  image  plane;  instead,  it  requires  a  more  complex  derivation 
that  involves  determination  of  (he  piercing  point  and  the  relative 
scale  factors.  These  values  are  derived  in  the  following  sections 
and  the  value  of  is  finally  computed  in  Section  3.5. 

3.3  Piercing  Point,  Principal  Ray,  and  Cropping 

Much  work  in  image  understanding  requires  knowledge  of  the 
piercing  / whit  (or  stare  point)  in  an  image.  This  is  the  point  in  an 
image  that  corresponds  to  the  world  point  at  which  the  camera 
was  aimed.  It  is  the  point  at  w  hich  the  principal  ray  (the  ray  along 
which  the  camera  was  aimed)  pierces  the  image  plane  (Figure  5). 
The  principal  ray  is  assumed  to  be  perpendicular  to  the  image 
plane.1 

To  find  the  piercing  point,  first  find  a  point  p  along  the  prin¬ 
cipal  ray  (other  than  L): 

p  =  L  +  in, 

w  here  k  is  any  scalar  except  0.  The  piercing  point  Uo  is  given  by 

Uo  =  ortlwprojr.ct(pM)  =  (u0,e0,0, 1) 

because  any  point  along  the  principal  ray  must  project  to  the 
piercing  point  in  the  image.  The  extent  to  which  the  image  has 
been  cropped  is  given  by  (tio.t'o). 

3.4  Focal  Length  and  Scale 

When  the  renter  of  projection  is  held  a  constant  distance  from 
the  scene,  there  is  no  way  to  tell  the  difference  between  scaling 
the  image  and  varying  the  focal  length.  For  example,  doubling 
the  focal  length  is  equivalent  to  enlarging  the  picture  by  a  factor 
of  two.  The  best  one  ran  hope  for  is  a  relation  between  the  two 

‘The  image  plane  in  some  cameras  used  in  photogrammetry  is  not  perpen- 
diculsr  to  the  line  of  sight;  this  rase,  however,  is  not  considered  here. 
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FIGURE  7  DETERMINING  THE  ROTATION  WITHIN  THE  IMAGE  PLANE 


FIGURE  6  DETERMINING  THE  SCALE  FACTOR 


parameters.  On  the  other  hand,  if  the  focal  length  of  the  camera 
is  known,  the  exact  scale  factors  can  be  determined. 

figure  C  shows  the  geometry  for  computing  the  U -component 
of  the  scale  factor.  First  choose  an  arbitrary  world  point  X| 
(not  on  the  principal  ray)  and  compute  its  image  point  Ui  = 
orlhoproject(xtM)  =  (ui. t’l.G,  1).  Conversion  of  these  image 
units  to  world  units  requires  dividing  by  the  scale  factor  such 
that 


and 


i  «t 

=  V' 
*1/ 


where  it1,  and  ej  are  the  distances  of  an  image  point  from  the 
image  origin,  measured  in  world  units.  Next  compute  a,  the  angle 
between  the  principal  ray  and  the  ray  from  L  to  X|,  projected 
in  the  plane  v=().  Then  it  is  clear  from  the  diagram  that  the 
following  relation  must  hold 


tan  a  = 


the  values  of  $  and  <t>  to  reconstruct  Rx  and  Ry  and  use  the  chosen 
focal  length  to  construct  a  perspective  projection  matrix  P'.  This 
comprises  a  model  A/,  =  TRxRtP ',  which  can  be  employed  to 
compute  the  transformation  of  an  arbitrary  world  point.  Call  the 
resulting  point  pi . 

Next,  use  the  previously  determined  piercing  point  to  recon' 
struct  the  matrix  C.  Then  undo  the  effects  of  cropping  from  the 
camera  model  by  multiplying  the  original  camera  transformation 
matrix  by  C~x  to  obtain 

A/'  =  MC~l  =  TRPSCC-'  =  TRPS. 


Now  the  efforts  of  unbalanced  scaling  will  be  eliminated.  Use  the 
relative  scale  factor  computed  earlier  to  construct  a  scale  trans¬ 
formation  matrix: 


10  0  0 

0  0  0 

0  o'  I  0 

0  0  0  1 


where  k„  is  the  magnification  of  the  image  in  the  U  direction.  If 
the  focal  length  is  known,  the  scale  factor 

,  _  «l  ~  «o 

“  / 1  an  a 

or  if  the  scale  factor  is  known, 

,  =  “i  ~  “o 
ku tan  a  ’ 

The  computation  of  k, ,  the  I’-component  of  the  scale  factor,  is 
identical.  Neither  k„  nor  l\  can  be  determined  individually  with¬ 
out  know  ledge  of  the  focal  length,  but  their  ratio  can  be  calculated 
from  quantities  derived  from  the  matrix: 

!•„  _  ii,  -  ti0  ,  »’i  ~  t]o  _  (u,  -  tip)  tana„ 
k„  /tan  o„  * /tan  a„  (e,-u0)tanau 

3.5  Rotation  within  the  Image  Plane 

We  now  return  to  the  derivation  of  0,  the  rotation  of  the  cam¬ 
era  about  the  IF  axis.  This  rotation  is  equivalent  to  cropping  the 
image  at  an  angle  'o  the  UV  coordinate  system.  The  value  of  0  is 
found  by  choosing  a  world  point  and  comparing  its  transformation 
under  two  different  situations,  as  illustrated  in  Figure  7. 

First,  use  the  coordinates  of  L  computed  earlier  and  an  arbi¬ 
trary  focal  length  to  reconstruct  the  translation  matrix  T.  Use 


Then  multiply  .\P  by  S'  to  obtain 

M2  =  M'S'  =  TRPSS'  =  TRPS", 

where 


Finally,  use  Mi  to  compute  the  transformation  of  the  previously 
chosen  world  point  and  call  the  result  p2. 

The  angle  0  can  now  be  determined  by  making  use  of  Equation 
1  and  the  follow  ing  observation.  The  only  differences  between  A/, 
and  Mi  are  their  focal  lengths,  a  scale  factor,  and  a  rotation 
about  the  IV  axis.  Although  the  scale  factor  is  unknown,  it  is 
equal  in  the  U  and  V  directions  because  this  was  compensated 
for  in  computing  M-  Together  the  scale  factor  and  focal  length 
differences  serve  only  to  change  the  size  of  the  image  and  impose 
no  other  distortions.  Observe  that  p,  is  the  image  point  that 
wotdd  be  obtained  if  there  were  no  rotation  about  the  W  axis, 
no  scaling  of  the  image,  and  no  cropping  of  the  image.  Similarly, 
P;  is  the  image  point  that  is  obtained  by  starting  with  the  true 
image  point  associated  with  the  chosen  object  point  and  undoing 
the  effects  of  cropping  and  unbalanced  scaling.  Any  difference 
between  p,  and  p2  must  be  the  result  of  different-sized  images 
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or  of  rotation  about  the  IV  axis.  Since  this  rotation  is  centered 
about  the  origin  of  this  coordinate  space,  the  angle  1 1>  can  be 
determined  by  measuring  the  angle  between  pi  and  p;  at  the 
origin.  The  differing  focal  lengths  and  scale  factors  can  affect 
only  the  distance  of  the  points  from  the  origin  and  cannot  alter 
the  angle  between  the  points  when  measured  from  the  origin. 

4  Discussion 

The  method  presented  here  provides  a  straightforward  way  of 
determining  the  parameters  of  the  imaging  process  from  a  homo¬ 
geneous  coordinate  transformation  matrix.  The  geometric  inter¬ 
pretation  provides  some  insight  into  what  the  equations  mean  and 
when  they  ntav  fail.  The  appendix  illustrates  application  of  the 
technique  to  several  sets  of  real  data. 

In  practice,  we  must  be  concerned  with  the  robustness  of  such 
an  algorithm  and  how  it  is  affected  by  errors  in  the  data.  For 
example,  if  /  or  k  are  very  large,  the  view  angle  subtended  by 
the  image  is  small  and  the  projection  is  nearly  orthogonal.  In  this 
case,  the  method  becomes  sensitive  to  the  precision  of  the  ma¬ 
trix,  and  only  the  camera's  orientation  can  be  ascertained  with 
confidence.  This  property  is  intrinsic  in  the  problem  formulation, 
and  any  method  that  derives  camera  parameters  from  the  corre¬ 
spondence  between  image  and  world  coordinates  is  subject  to  this 
sensitivity.  The  parameters  computed  by  the  methods  outlined  in 
this  paper  can  be  used  to  reconstruct  the  camera  transformation 
matrix  (within  a  choice  of  focal  length)  when  synthetic  data  are 
used.  When  empirical  data  are  used,  as  in  the  appendix,  instabil¬ 
ities  in  the  matrix  often  make  it  impossible  to  reconstruct  it  with 
accuracy. 

The  camera  model  used  throughout  this  paper  is  somewhat 
simplified.  The  image  plane  has  been  assumed  to  be  perpendicu¬ 
lar  to  the  principal  ray  and  the  image  axes  are  assumed  to  be  per¬ 
pendicular.  Furthermore,  the  effects  of  a  non-ideal  lens  and  other 
nonisotropic  distortions  have  been  ignored.  The  accuracy  of  the 
decomposition  will  degrade  if  these  assumptions  are  not  valid. 
While  we  do  not  expect  this  technique  to  be  any  more  robust 
than  that  of  Ganapathy,  we  do  feel  that  its  geometric  interpreta¬ 
tion  provides  useful  clues  as  to  when  it  will  be  dependable.  The 
method’s  utility  has  been  demonstrated  on  actual  photographs 
and  because  it  is  non-iterative,  the  computational  burden  is  in¬ 
significant  . 
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A  Examples 

W'e  now  present  two  examples  to  illustrate  our  technique. 

A.l  Imagery  from  Robotics  Applications 

Ganapathy  used  the  following  experimentally  determined  3  x 
4  matrix  to  demonstrate  his  method  [4]. 

-2.3819  0.49648  -.039462  847.40  ‘ 

-.043897  -.062872  -2.4071  882.91 

-.00026388  -.00062759  -.000071843  1.0 

This  matrix  is  used  to  obtain  the  image  coordinates  (us,  vs, a) 
by  premultiplying  the  world  coordinates,  (xt,yt,  :t,t),  by  the  ma¬ 
trix.  To  make  it  compatible  with  the  notation  used  throughout 
this  paper,  it  must  be  transposed  and  an  arbitrary  column  vector 
inserted.  This  column  is  the  one  that  determines  w  and  does  not 
affect  the  imaging  process.  The  matrix,  suitably  rewritten,  is 
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-2.3819  -.043897 

0.49648  -.062872 

-.039462  -2.4071 

847.40  882.91 


0.0  -.00026388 

0.0  -.00062759 

1.0  -.000071843 
0.0  1.0 


From  this  matrix  the  following  values  were  obtained  by  the 
method  presented  in  Section  3. 
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L  =  (620.51, 1295.68,321.64) 

T'lis  agrees  closely  with  Ganapathy's  determination  of  the  cam¬ 
era’s  location: 

(620.9344, 1295.476,321.8140) 

The  orientation  of  the  camera  is  computed  from  the  normal  to 
the  image  plane: 

ff  =  (-.3855, -.9167, -.1049) 

This  yields  (in  degrees) 

$  =  157.1949  <t>  =  -6.0239  ip  =  359.909 


in  agreement  with  Ganapathy’s  results: 


FIGURE  8  PHOTOGRAPH  OF  SAN  FRANCISCO 


=  157.1951  <p  =  -6.023912  i- =  359.6915 
Other  parameters  obtained  from  A/|  include 


piercing  point:  (uo,t>o)  =  (682,  478)  in  pixels 
focal  length:  f  ■  ku  =  3488 


/  •  kv  =  3485 


relative  scale  factor:  ku/kv  =  1 .0009 


A. 2  Outdoor  Imagery 

Figure  8  shows  a  photograph  taken  from  a  book  of  pictures 
of  San  Francisco  [2],  It  was  necessary  to  determine  the  imaging 
geometry  in  order  to  use  the  picture  for  work  in  image  understand¬ 
ing.  Ground  truth  data  were  obtained  manually  from  a  map  of 
the  city.  A  total  of  fifteen  pairs  of  image  and  world  coordinates 
were  used  to  obtain  the  following  camera  transformation  matrix 
with  a  least-squares  program: 


Mi  = 


.172137  .131132 

-.15879  .112747 
.0187902  .291494 
274.943  258.686 


0.0  .000346452 

0.0  .000311253 

1.27976  .0000656643 

0.0  1.0 


The  results  computed  from  this  image  are  plotted  on  the  map 
in  Figure  9  and  described  further  below.  The  camera  location 
was  computed  lo  be  near  the  intersection  of  California  and  Mason 
Streets  at  an  elevation  of  435  feet  above  sea  level.  The  camera 
was  oriented  as  shown  on  the  map,  at  an  angle  of  8*  above  the 
horizon.  Computation  of  the  piercing  point  is  sensitive  to  errors  in 
the  matrix  berausc  the  projection  is  nearly  orthographic,  but  the 
location  derived  is  marked  by  the  point  P  in  the  image  in  Figure 
8.  The  focal  length  and  scale  factor  relations  were  computed  to 
be  f  ■  ku  =  495  and  /  •  kv  =  560,  indicating  an  aspect  ratio  of 
kjk„  =  .88. 

Figures  10  and  11  show  the  results  for  another  photograph 
of  San  Francisco.  The  Gamera  transformation  matrix  computed 
from  16  points  of  ground  truth  data  is  shown  here: 


FIGURE  9  MAP  OF  SAN  FRANCISCO 


The  elevation  of  the  camera  was  found  to  be  1200  feet  above 
sea  level  and  the  inclination  was  4°  above  he  horizon.  The  hor¬ 
izontal  location  and  orientation  are  plotted  in  Figure  11.  The 
piercing  point  was  unreliably  computed  to  be  at  the  point  P 
in  Figure  10.  The  focal  length  and  scale  factor  relations  were 
f  k,  =  876  and  f  ■  kv  —  999,  implying  an  aspect  ratio  of 
*»/**  =  88. 


Ms  = 


-.175451 

-.105205 

.0043556 

297.836 


.0269801 

-.0963531 

.23031 

249.574 


0.0 

0.0 

1.07834 

0.0 


.000151628 

-.00016085 

.0000159749 

1.0 
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FIGURE  10  ANOTHER  PHOTOGRAPH  OF  SAN  FRANCISCO 


FIGURE  11  MAP  OF  SAN  FRANCISCO 
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Optical  Navigation  by  the  Method  of  Differences 


Bruce  D.  Lucas  and  Takeo  Kanadc 
Computer  Science  Department 
Carnegie- Mellon  University 
Pittsburgh,  PA  15213 


Abstract.  The  method  of  differences  refers 
to  a  technique  for  image  matching  that  uses  the 
intensity  gradient  of  the  image  to  iteratively  im¬ 
prove  the  match  between  the  two  images.  Used  in 
an  iterative  scheme  combined  with  image  smooth¬ 
ing,  the  method  exhibits  good  accuracy  and  a 
wide  convergence  range.  In  this  paper  we  show 
hew  the  technique  can  be  used  to  directly  solve  for 
the  parameters  relating  two  cameras  viewing  the 
same  scene.  The  resulting  algorithm  can  be  used 
fox  optical  navigation,  which  has  applications  in 
robot  arm  guidance  and  autonomous  roving  ve¬ 
hicle  navigation.  Because  of  the  regular  struc¬ 
ture  of  the  algorithm,  the  prospects  of  carrying  it 
out  with  special-purpose  hardware  for  real-time 
control  of  a  re  hot  seem  good.  We  present  exper¬ 
imental  results  demonstrating  the  accuracy  and 
range  of  convergence  that,  can  be  expec  ted  from 
the  algorithm. 
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1.  Introduction 

Optical  navigation  refers  to  the  detection  of 
the  position  and  orientation  of  a  camera,  and  thus 
of  the  object  that  it  is  attached  to,  by  analysis  of 
the  picture  taken  by  the  camera.  In  its  abstract 
form,  the  objective  of  such  analysis  is  to  deter¬ 
mine  some  or  all  of  the  six  parameters  (three  of 
position  and  three  of  orientation)  that  determine 
the  position  of  that  camera  relative  to  some  fixed 
frame  of  reference.  In  our  method  and  in  many 
others  the  fixed  frame  of  reference  is  that  of  a  sec¬ 
ond  camera,  so  that  the  problem  is  that  of  image 
comparison.  Optical  navigation  has  a  number  of 
applications  including  navigation  of  autonomous 
roving  vehicles  and  navigation  of  a  robot  arm  rel¬ 
ative  to  the  object  on  which  it  is  performing  its 
task.  We  will  discuss  these  applications  in  more 
detail  below. 

Several  approaches  to  optical  navigation  arc 
possible.  These  may  be  divided  into  three  cat¬ 
egories:  sparse  two-dimensional  matching,  con¬ 
tinuous  two-dimensional  matching,  and  three- 
dimensional  matching.  In  the  sparse  two- 
dimensional  approach,  one  starts  with  a  discrete 
set  of  points  in  the  two  images  that  match,  and 
deduces  the  camera  motion.  In  the  case  of  a 
sparse  set  of  points,  there  is  a  question  of  how 
many  points  are  necessary  to  uniquely  solve  for 
the  camera  parameters;  this  has  been  addressed 
by  Tsai  &  Huang  (1981).  With  more  points,  the 
problem  is  overspecified  and  a  least-squares  ap¬ 
proach  is  required  (Genncry,  1980).  In  the  con¬ 
tinuous  two-dimensional  matching  approach,  one 
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starts  with  a  whole  image  field  of  matches  (the 
“optical  flow  field”);  Bruss  &  Horn  (1983)  have 
shown  how  how  to  determine  the  camera  motion 
from  the  optical  flow  field,  again  using  a  least- 
squares  formulation.  Obtaining  the  optical  flow 
field  has  been  investigated  by  several  people,  for 
example  Horn  &r  Schunck  (1981)  and  Cornelius  &c 
Kanade  (1983).  In  the  three-dimensional  match¬ 
ing  approach,  corresponding  points  in  three  di¬ 
mensions  (obtained  e.g.  by  stereo)  are  used  to 
determine  the  camera  motion;  this  technique  was 
used  by  Moravec  (1980)  to  navigate  a  rover. 

These  approaches  all  split  the  process  into 
two  steps:  finding  the  matches  and  using  those 
matches  to  solve  for  the  camera  parameters.  In 
this  paper  we  show  how  to  combine  the  two  steps 
into  one,  by  applying  a  generalized  image  match¬ 
ing  technique  that  we  term  the  method  of  differ¬ 
ences.  The  method  of  differences  directly  com¬ 
putes  the  six  camera  parameters,  much  as  stan¬ 
dard  matching  techniques  compute  two  parame¬ 
ters  (the  x  and  y  displacements).  That  is,  the 
camera  parameters  are  explicitly  included  in  the 
matching  process.  The  method  takes  advantage 
of  the  fact  that,  in  many  applications  the  approx¬ 
imate  position  and  orientation  of  the  camera  are 
Known.  Starting  from  that  estimate  we  compute 
a  better  estimate  by  using  the  image  intensity 
gradient  as  a  guide.  By  using  an  iterative  scheme 
our  estimates  converge  to  the  correct  .value.  The 
result  is  a  technique  that  is  fast  and  free  of  search. 

The  method  of  differences  and  similar  meth¬ 
ods  based  on  image  gradients  have  been  ap¬ 
plied  before  to  the  anedysis  of  small  image  mo¬ 
tions  (Limb  &c  Murphy,  1975;  Cafforio  &  Rocca, 
1979),  to  optical  flow  field  determination  (Horn 
&•  Schunck,  1981),  and  to  stereo  image  analysis 
(Lucas  &:  Kanade,  1981).  Here  we  show  how  it 
can  also  be  applied  to  navigation. 

In  the  remainder  of  the  paper,  we  first 
present  several  robotics  scenarios  calling  for  op¬ 
tical  navigation.  Then  we  describe  the  method 
of  differences  in  a  one-dimensional  case,  which 
serves  to  illustrate  many  of  the  issues.  Then  we 
show  how  the  same  technique  can  be  used  for 
multi-parameter  estimation.  Finally,  we  present 


some  experimental  results  and  draw  some  conclu¬ 
sions. 

2.  Applications 

Many  robotic  tasks  require  a  knowledge  of 
the  position  and  orientation  of  the  “robot”  This 
is  because  mechanical  imperfections  and  environ¬ 
mental  uncertainty  make  it  impossible  to  know 
exactly  how  a  robot  will  move  in  response  to  the 
commands  sent  to  it  and  exactly  what  it  will  en¬ 
counter  in  its  surroundings.  Optical  navigation 
takes  its  place  alongside  various  types  of  mechan- 
ica’  navigation  to  correct  this  problem,  and  in¬ 
deed  has  distinct  advantages  with  respect  to  re¬ 
sponding  to  environmental  uncertainty  over  those 
methods.  Such  tasks  can  be  classified  along  sev¬ 
eral  dimensions;  three  of  particular  interest  to  us 
are  the  nature  of  the  robotic  agent,  the  nature 
of  the  environment,  and  the  manner  in  which  the 
optical  navigation  is  used. 

Robotic  agents.  For  our  purposes,  robotic 
agents  can  be  roughly  divided  into  two  classes: 
fixed-base  robot  arms  and  autonomous  roving  ve¬ 
hicles  (although  many  vehicle  designs  call  for  the 
rover  to  have  a  robotic  arm  or  manipulator  of 
some  sort).  In  both  cases  optical  navigation  can 
play  an  important  role,  although  the  nature  of 
the  navigation  may  be  different. 

For  example  the.  “world”  in  which  a  ma¬ 
nipulator  moves  is  generally  smaller  than  that 
in  which  a  rover  moves,  and  so  the  nature  and 
quality  of  the  image  obtained  may  differ  (con¬ 
sider  such  effects  as  depth  of  field).  Moreover, 
the  two  types  of  agent  differ  with  respect  to  the 
availability  of  other  sources  of  information  about 
the  robot's  movement.  A  rover  might  well  be  ca¬ 
pable  of  inertial  or  radio  navigation  in  addition 
to  optical  navigation.  On  the  other  hand  robot 
arms  typically  operate  in  environments  where 
such  techniques  as. structured  lighting  may  feasi¬ 
bly  provide  additional  information,  while  such  is 
generally  not  the  case  for  rovers.  Such  additional 
source'-  of  knowledge  could  be  incorporated  into 
the  method,  for  example  to  provide  the  necessary 
initial  estimate  of  position. 
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A  second  point  of  difference  is  that  tiie  mech¬ 
anisms  for  control  of  an  arm  and  a  rover  are  quite 
different:  an  arm  is  generally  controlled  by  the  co¬ 
ordinated  rotations  of  a  number  of  joints,  while 
the  a  rover  is  moved  under  the  power  of  a  num¬ 
ber  of  wheels  or  legs.  Thus,  the  method  must 
be  adapted  differently  in  each  case  if  it  is  desired 
to  directly  solve  for  the  joint  positions,  the  wheel 
rotations,  or  even  the  control  signals  that  cause 
those  movements. 

Another  point  of  difference  is  in  the  degree  of 
constraint  that  exists  in  the  motions  of  t fie  robot. 
Typically,  a  manipulator  will  be  able  to  move  in 
all  six  degrees  of  freedom,  while  a  rover  may  only 
have  freedom  in  say  the  pan,  x,  and  2  motions. 
These  constraints  boar  on  the  specific  formula¬ 
tions  of  the  technique  for  the  different  tasks. 

Environment.  In  a  known  environment,  the 
robot  must  perforin  a  series  of  maneuvers  with  re¬ 
spect  to  a  sot  of  objects  in  the  environment  whose 
nature  and  approximate  position  are  known  in 
advance.  I11  an  unknown  environment,  rase,  the 
robot  must  be  prepared  to  encounter  anything, 
or  at  least  a  variety  of  objects.  Tasks  in  which 
a  robot  arm  mu-t  operate  in  an  unknown  envi¬ 
ronment  are  conceivable,  but  unlikely;  therefore 
we  will  confine  ourselves  to  the  rover  case,  but 
our  remarks  concerning  the  rover  operating  in  a 
known  environment  will  apply  equally  well  to  the 
robot  arm  case. 

In  the  case  of  a  known  environment,  the  sce¬ 
nario  for  the  use  of  the  method  of  differences  is 
as  follows:  the  robot  is  taken  through  the  series 
of  operations  that  it  will  be  expected  to  perform. 
As  it  does  this  it  records  a  number  of  camera 
views  sufficient  to  cover  the  substantially  differ¬ 
ent  situations  the  robot  will  encounter.  Those 
images  will  serve  as  landmarks  with  respect  to 
which  the  robot  will  later  navigate.  Then  a  num¬ 
ber  of  reference  points  are  chosen  from  each  im¬ 
age  and  are  assigned  a  distance  from  the  cam¬ 
era.  This  can  be  done  a  number  of  ways:  a  sec¬ 
ond  camera  in  a  known  position  with  respect  to 
the  first  can  provide  a  stereo  baseline  for  making 
the  measurements;  a  known  model  of  the  envi¬ 
ronment  can  be  fitted  to  those  points;  the  dis¬ 


tances  can  be  directly  measured;  the  distances 
can  be  obtained  by  structured  lighting;  and  so 
on.  In  any  case,  since  this  is  a  training  step  to  be 
done  only  once,  the  assignment  of  distances  need 
not  be  completely  automated.  This  concludes  the 
training  process.  Then,  at  each  step  of  an  ac¬ 
tual  run,  the  image  received  by  the  camera  will 
be  compared  against  one  of  the  stored  reference 
images  (actually,  only  the  positions  and  intensi¬ 
ties  of  the  reference  points  are  need  be  stored), 
and  the  method  of  differences  will  be  used  to  de¬ 
termine  the  position  of  the  robot  relative  to  the 
reference  coordinate  system.  In  a  variation  of  this 
process,  the  method  of  differences  can  be  used  to 
directly  solve  for  the  control  signals  to  be  fed  to 
the  robot. 

The  case  of  an  unknown  environment  is  more 
difficult.  It  is  not  possible  to  store  reference  im¬ 
ages,  so  the  piocess  of  selecting  and  determining 
the  distances  of  reference  points  must  be  carried 
out  anew  each  time  the  rover  encounters  a  new 
situation.  Techniques  for  doing  this,  such  as  by 
automatic  stereo  vision,  are  themselves  the  sub¬ 
ject  of  research.  Once  the  reference  points  are  lo¬ 
cated  and  their  distances  determined,  they  can  be 
used  to  navigate  until  they  arc  lost  from  view.  If 
necessary,  the  stereo  solver  can  be  called  again  at 
each  step  to  refine  the  estimates  of  the  distances 
and  to  determine  the  distances  of  new  reference 
points  acquired  to  replace  old  ones  that  have  dis¬ 
appeared  from  view. 

Manner  of  application.  Finally,  optical 
navigation  can  be  used  in  a  robot  in  one  of  two 
ways.  If  the  position  of  the  robot  can  be  calcu¬ 
lated  quickly  enough  (for  example  with  the  aid 
of  special-purpose  hardware),  then  the  result  can 
be  used  in  a  continuous  feedback  loop.  Two  as¬ 
pects  of  the  method  are  favorable  for  this  mode, 
in  a  feedback  loop  the  range  of  converger.  :e  of 
the  algorithm  need  not  be  large,  because  the  po¬ 
sition  will  be  calculated  frequently  enough  that 
the  robot  will  have  moved  relatively  little  during 
that  time.  Moreover,  the  method  need  not  ac¬ 
curately  compute  th(>  position  of  the  robot,  but 
rather  needs  to  be  taught  what  the  camera  should 
see  when  it  is  in  the  desired  position:  as  long  as 


the  method  is  able  to  provide  signals  to  move  in 
the  proper  direction  when  the  camera  is  out  of 
position  and  provided  that  the  method  can  de¬ 
tect  when  the  camera  is  seeing  what  it  is  expect 
see,  the  position  of  the  camera  will  converge  to 
the  correct  position.  Of  course  proper  precau¬ 
tions  are  necessary  to  prevent  oscillation. 

Even  if  the  method  cannot  be  applied  fast 
enough  to  be  used  in  a  feedback  mode,  it  can 
still  be  fruitfully  used  in  a  “stop-and-go”  mode. 
In  such  a  case  the  rover  will  move  greater  dis¬ 
tances  between  each  application  of  the  method, 
and  so  the  demands  on  the  optical  navigation  are 
greater:  it  must  exhibit  in  this  case  a  greater 
range  over  which  it  will  converge  to  the  proper 
value. 


fcrence  between  the  images,  I\{x)  -  h{x)>  with 
the  derivative  Dxl2[x)  (which  will  in  fact  be  im¬ 
plemented  as  a  difference),  to  obtain  an  estimate 
for  the  parameter  h. 


Iteration  and  smoothing.  As  it  stands,  this 
method  would  not  work  very  well.  However,  two 
additional  modifications  make  it  a  viable  tech¬ 
nique.  First,  because  the  method  yields  only  an 
approximation  h  to  the  disparity  h,  we  must  use 
an  iterative  scheme  to  obtain  an  accurate  result. 
The  idea  is  to  calculate  an  estimated  disparity, 
move  I2  by  that  amount,  and  calculate  again. 
Starting  with  an  initial  estimate  h0,  the  iteration 
is  given  by 


3.  The  technique 

Parameter  estimation  by  the  method  of 
differences.  First  we  present  the  technique  for 
the  one-dimensional,  one-parameter  case  to  give 
its  flavor.  Consider  two  one-dimensional  images 
1 1(2)  and  h{x)  related  by  a  translation,  so  that 
h(x)  =  +  h);  we  wish  to  estimate  the  trans¬ 

lation  h.  One  way  to  do  this  is  to  find  that  h  that 
minimizes  the  total  squared  error, 

£  =  £  (h(x  +  h)-/i(x))2. 

X 

Since  we  want  a  local,  non-searching  algorithm, 
we  approximate  I2(x  +  h)  using  I2(x)  on  the  basis 
of  local  information,  namely  the  derivative;  this 
yields  the  approximation 

+  hDxI2{x)  -  h{x))\  '  (1) 

X 

where  Dx  denotes  partial  differentiation  with  re¬ 
spect  to  x.  This  equation  is  quadratic  in  h,  so  we 
can  differentiate  with  respect  to  h,  set  equal  to 
zero,  and  solve  the  resulting  linear  equation  for 
h,  obtaining 

_  Ex  (/|(x)-/2(x))Px/2(x) 

E*0x/2(X)’ 

This  one-parameter  case  illustrates  the  nature  of 
the  method.  We  call  it  the  method  of  differ¬ 
ences  because  it  is  based  on  comparing  the  dif- 


h  = 


(2) 


hi-i-i  =  hi  + 

Ex  (A(x)  -  h(x  +  k))pxh{x  +  hi) 
Ex£>x/2(x  +  h.)2 


.(3) 


Note  that  if  hi  =  h ,  then  our  calculated  increment 
is  zero  (since  Ii(x)  =  I2{x  +  h)),  so  h  is  indeed  a 
convergence  value  for  the  algorithm. 

Second,  to  improve  the  accuracy  and  range  of 
validity  of  the  linear  estimate  used  in  (1),  we  must 
smooth  the  image.  This  can  be  thought  of  as 
smoothing  out  purely  local  Lumps  and  wrinkles  in 
the  image  intensity  profile  that  would  make  a  lin¬ 
ear  estimate  accurate  only  over  a  small  range.  Al¬ 
ternatively,  smoothing  can  be  understood  in  the 
frequency  domain  as  reducing  the  high  frequency 
components.  We  have  shown  (Lucas,  1984)  that 
h  as  calculated  in  (2)  can  be  expressed  as 
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where  the  r*  arc  the  coefficients  of  the  Fourier 
scries  for  I2'- 

h(x)  =  yVfcSin  (x  +  0fc)- 

fc>0 


This  is  a  very  interesting  result,  because  it  says 
that  the  estimated  disparity  h  depends  only  on 
the  power  spectrum  r\  of  the  image,  which  tends 
to  be  similar  from  image  to  image:  and  is  indepen¬ 
dent  of  the  phase  spectrum,  which  is  where  all  the 
information  in  the  image  is  (Kretzmer,  1952;  Op- 
penheim  &  Lim,  1981).  In  particular,  the  power 
spectrum  r\  tends  to  fall  off  as  k  goes  to  infinity. 
Thus  the  behavior  of  the  disparity  calculation  in 
(2)  will  be  similar  from  image  to  image,  and  so 
experiments  can  yield  generally  valid  results. 

Indeed,  inspection  of  (4)  shows  that  the  es¬ 
timated  disparity  h,  as  a  function  of  the  real  dis¬ 
parity  h,  is  zero  when  when  h  is  zero  (as  already 
noted),  increases  as  h  increases  (with  the  proper 
sign),  and  finally  falls  back  to  zero  as  h  increases 
further.  (Examples  of  such  curves  are  presented 
later  in  Figure  3).  Furthermore,  the  more  high- 
frequency  terms  are  present,  the  smaller  are  the 
values  of  disparity  h  for  which  the  disparity  es¬ 
timate  h  falls  off  to  zero.  For  example,  if  I\(x) 
were  a  pure  sine  wave,  then  we  could  not  in  prin¬ 
ciple  calculate  the  disparity  if  h  is  greater  than 
one-half  wavelength.  Both  lines  of  reasoning  thus 
lead  to  the  conclusion  that  smoothing  will  extend 
the  range  disparities  h  over  which  the  iteration 
will  converge.  While  this  analysis  is  not  exact 
(for  example,  (4)  is  strictly  true  only  if  the  sum 
in  (2)  extends  over  the  whole  real  line  or  over 
one  period  of  a  periodic  function),  experiments 
we  present  below  will  verify  the  applicability  of 
the  conclusion. 

Thus,  larger  smoothing  windows  yield  wider 
disparity  ranges  over  which  the  algorithm  con¬ 
verges,  but  may  not  produce  an  accurate  answer. 
The  relationship  between  iteration  and  smooth¬ 
ing  is  that  successive  steps  of  the  iteration  can 
be  done  with  progressively  loss  smoothed  images, 
in  a  sort  of  coarse-fine  approach.  This  allows  the 
algorithm  to  tolerate  a  large  disparity  yet  yield 
an  accurate  answer. 

Multi-parameter  case.  The  multi- 

parameter  case  is  analogous.  Consider  the  cam¬ 
era  model  illustrated  in  Figure  I.  We  start  with 


an  estimate  of  the  vector  c  of  six  camera  param¬ 
eters,  which  define  the  relationship  between  the 
two  camera  coordinate  systems.  We  wish  to  esti¬ 
mate  the  effect  that  a  change  Ac  in  the  camera 
parameters  c  will  have  on  the  error,  given  by 

E  =  J2(h(y)~h(  p))2 

p 

After  the  change,  the  error  will  be  approximately 

E^Y.  Wv>  +  Ac(D.u)(D„v)(Dv/,(v)) 
'  -  A(p))a.(S) 

Here,  -Dctt  indicates  the  matrix  of  partial  deriva¬ 
tives  that  describes  how  the  position  of  the  three- 
space  point  u  in  the  Camera  2  coordinate  system 
varies  as  the  position  of  that  system  (which  is 
determined  by  c)  is  varied;  Z?uv  describes  how 
the  projection  v  of  the  point  u  onto  the  Cam¬ 
era  2  coordinate  system  varies  as  the  position  of 
that  point  varies;  and  DVI2  is  simply  the  inten¬ 
sity  gradient  of  the  image  /2.  (We  use  row  vectors 
for  c  etc.  so  that  Dc  etc.  can  be  prefix  operators; 
Dc  for  example  is  a  column  vector  of  the  par¬ 
tial  derivatives  w.r.t.  c.)  Their  product,  by  the 
chain  rule,  tells  us  how  the  image  intensity  of  r2 
at  the  presumed  match  v  to  the  point  p  varies  as 
we  vary  the  camera  parameters  c.  By  knowing 
this  at  each  of  the  many  points  p  over  which  the 
summation  runs,  we  can  estimate  how  we  should 
change  the  camera  parameters  c  to  make  the  in¬ 
tensities  at  each  of  the  points  v  match  those  at 
their  respective  matching  points  p.  To  do  this, 
we  differentiate  (5)  with  respect  to  Ac,  set  equal 
to  zero,  and  solve  obtaining 

Ac  =  [  -  £  (/j(v)  -  Mp))(o.«»))] 

p 

•  [£(Dc/2(v))(Dc/2(v))T\(6) 
P 

where 

DcI2(v)  =  (Dcu)(Z?uv)(Z?v/2(v)). 

Note  that  Dc/2(v)  is  a  6  x  1  matrix  (column  vec¬ 
tor),  so  that  the  second  sum  of  (6)  is  a  matrix 


that  must  be  inverted.  This  raises  the  question  of 
under  what  conditions  this  inversion  will  be  possi¬ 
ble.  We  have  shown  elsewhere  (Lucas,  1984)  that 
this  matrix  will  be  singular  under  orthography  in 
the  case  where  all  six  parameters  are  to  be  solved 
for,  and  conversely  that  it  will  be  non-singular  un¬ 
der  perspective.  However,  if  the  reference  points 
are  positioned  in  three-space  so  that  the  perspec¬ 
tive  projection  is  well-approximated  by  the  ortho¬ 
graphic  projection,  then  the  matrix  will  be  nearly 
singular.  This  implies  that  the  points  should  not 
be  too  far  fro: a  the  camera  and  should  be  well- 
distributed  in  distance  from  the  camera. 

Having  computed  Ac  according  to  (6),  we 
update  our  estimate  c  to  c  + Ac.  yielding  a  better 
estimate.  This  procedure  results  in  an  iterative 
scheme  analogous  to  that  in  (3).  Moreover,  the 
previous  comments  concerning  smoothing  the  im¬ 
ages  to  improve  the  range  of  convergence  apply 
as  well  to  the  multi-parameter  case. 

4.  Experimental  results 

Our  experimental  data  consisted  of  three 
views  of  the  same  scene  taken  by  a  camera 
mounted  on  the  Stanford  cart  (Moravec,  1980); 
they  are  shown  in  Figure  2.  The  camera  was 
mounted  on  a  slider,  so  we  had  accurate  knowl¬ 
edge  of  the  relative  positions  of  the  cameras.  The 
three  views  were  pictures  taken  by  the  camera  at 
the  left,  middle,  and  right  slider  positions,  with 
26  cm  separating  each  position.  The  left  picture 
was  used  as  the  reference  image,  and  a  number 
of  points  were  selected  from  this  image  as  ref¬ 
erence  points.  These  are  the  points  p  that  the 
sum  in  (5)  runs  over.  Then  the  right  picture  was 
used  .as  the  second  image  of  a  stereo  pair  to  ob¬ 
tain  the  distances  z(p)  of  the  reference  points  p. 
The  method  was  then  used  to  determine  position 
of  the  middle  camera.  Since  the  position  of  the 
middle  camera  was  known,  we  could  assess  the 
accuracy  of  the  method.  Moreover,  by  varying 
around  the  correct  value  the  initial  estimate  of 
♦  he  middle  camera’s  position  that  we  provided  to 
the  algorithm,  we  could  determine  the  range  of 
convergence. 


Convergence  range.  The  convergence 
range  in  a  typical  one-parameter  case  can  be  de¬ 
termined  from  the  graph  in  Figure  3.  In  this  ex¬ 
periment  we  solved  lor  x  while  holding  the  other 
five  parameters  fixed.  This  graph  illustrates  that 
the  computed  adjustment  (vertical  axis)  to  the  x 
estimate  (horizontal  axis)  is  zero  when  the  value 
of  x  is  at  the  convergence  point  near  the  correct 
value  of  26  cm,  increases  as  the  disparity  increases 
and  has  the  pr<  per  sign,  and  then  falls  back  to 
zero.  The  range  of  convergence  is  the  interval 
over  which  the  computed  adjustment  is,  say,  at 
least  0.1  times  that  required  to  move  the  esti¬ 
mate  to  the  correct  answer  at  the  point  where  the 
curves  cross  the  horizontal  axis,  which  is  roughly 
the  same  as  the  region  over  which  the  adjust¬ 
ment  has  the  right  sign.  For  the  largest  smooth¬ 
ing  window,  the  range  of  convergence  is  nearly 
a  meter  in  one  direction  and  well  over  a  meter 
in  the  other  direction;  the  range  is  smaller  for 
the  smaller  smoothing  windows.  Note  that  the 
convergence  value  for  the  larger  window  is  not 
the  same  as  for  the  smaller  windows,  but  is  well 
within  their  convergence  range.  This  means  that 
a  coarse-fine  technique  would  work,  and  indeed 
could  skip  right  from  the  largest  window  to  a 
much  smaller  one,  avoiding  the  need  to  calculate  a 
multitude  of  smoothed  images.  The  convergence 
range  for  y  is  similar,  and  for  z  is  somewhat  larger 
(although  the  result  is  less  accurate).  The  con¬ 
vergence  ranges  for  pan  and  tilt  are  ±10  degrees 
or  so,  and  the  convergence  range  for  roll  is  around 
±30  degrees.  These  convergence  ranges  (except 
lor  roll)  arc  of  course  dependent  on  the  geometry 
of  the  situation. 

What  is  the  relationship  between  these  con¬ 
vergence  ranges  and  the  convergence  ranges  in  the 
multi-parameter  case?  This  is  shown  in  Figure  4. 
We  sec  that  if  we  solve  for  two  parameters  (pan 
and  tilt,  top  graph),  the  range  is  smaller  than  the 
range  that  would  be  expected  on  the  basis  of  the 
one-parameter  results  for  pan  and  tilt  alone;  and 
if  we  solve  for  all  six  parameters  (bottom  graph), 
it  is  smaller  still.  Nevertheless,  the  range  is  still 
quite  adequate  for  the  continuous  feedback  mode. 
Whether  it  is  adequate  for  the  stop-and-go  mode, 


which  involves  a  larger  motion  at  each  step,  de¬ 
pends  on  the  accuracy  of  the  arm  and  on  the  ac¬ 
curacy  of  other  navigational  aids  that  can  provide 
the  initial  estimates. 

Accuracy.  To  assess  the  accuracy  under  a 
variety  of  conditions,  wc  select  reference  points 
using  a  variety  of  methods,  including  by  hand 
and  by  computer,  resulting  in  several  sets  of  data 
points  of  various  sizes.  Then  we  doubled  the 
number  of  sets  of  reference  points  by  either  ap¬ 
plying  or  not  applying  a  pruning  process  to  the 
sets  we  had.  This  pruning  process,  which  is  de¬ 
scribed  elsewhere  (Lucas,  1984),  was  based  on  the 
method  of  differences  and  served  to  improve  the 
accuracy  of  the  stereo  matches.  It  also  eliminated 
some  points  as  being  unfit  for  use  by  the  method, 
for  example  because  they  were  in  a  region  of  small 
gradient.  The  results  are  shown  in  Figure  5.  Sev¬ 
eral  general  trends  are  observable.  First,  using 
more  points  produces  more  accurate  results.  Sec¬ 
ond,  the  pruning  process  can  to  improve  the  re¬ 
sults,  as  evidenced  by  the  left  endpoints  of  the 
lines  in  the  figure  being  lower  than  the  right  end¬ 
points.  These  two  factors  are  of  course  in  conflict, 
and  the  improvement  due  to  the  pruning  process 
is  apparent  only  provided  the  number  of  points  is 
not  reduced  too  much.  Finally,  the  accuracy  docs 
not  seem  to  be  affected  much  by  the  number  of 
parameters  solved  for. 

Implementation.  The  implementation  may 
be  divided  into  two  parts:  smoothing  and  cam¬ 
era  parameter  estimation.  The  smoothing  must 
be  done  over  a  relatively  large  window,  up  to 
65  x  65  in  our  experiments.  It  is  the  most 
time-consuming,  even  though  wc  implemented  it 
as  uniform  smoothing  over  a  rectangular  region, 
which  by  a  well-known  algorithm  takes  a  constant 
number  of  operations  (two  additions  and  two  sub¬ 
tractions)  per  pixel,  regardless  of  the  size  of  the 
smoothing  window.  However,  it  is  fairly  well  un¬ 
derstood  how  to  build  special-purpose  hardware 
for  doing  smoothing  quickly,  essentially  in  real 
time. 


The  parameter  estimation  step  is  more  in¬ 
teresting.  Our  implementation,  in  which  no  at¬ 
tention  was  paid  to  efficiency,  requires  approxi¬ 
mately  3  to  4  ms  per  reference  point  per  iteration 
on  a  VAX  11/780.  In  the  continuous  feedback 
mode,  only  one  iteration  per  time  step  would  be 
used  since  only  an  approximate  answer  is  needed. 
Thus  50  reference  points  (the  largest  number  used 
in  the  experiments  reported  above)  would  require 
less  than  200  ms  per  time  step.  This  figure  could 
probably  be  improved  severalfold  by  more  care¬ 
ful  coding  and  taking  account  of  the  fact  that 
some  of  the  entries  in  the  matrices  in  equation 
(6)  are  known  a  priori  to  be  zero.  This  infor¬ 
mation,  together  with  the  fact  that  the  algorithm 
has  a  regular  structure  free  of  decision  points  that 
could  easily  be  implemented  in  special-purpose 
hardware,  suggests  that  it  is  feasible  for  real-time 
control  of  a  robot. 

It  should  be  noted  that  a  considerable  stor¬ 
age  savings  is  possible  with  respect  to  the  refer¬ 
ence  image.  The  image  intensities  of  the  reference 
image  and  its  smoothed  versions  are  needed  only 
at  the  reference  points  p.  Thus,  the  entire  refer¬ 
ence  image  need  not  be  stored. 

Wc  have  demonstrated  that  the  method  of 
differences  provides  a  useful  technique  for  optical 
navigation.  We  have  shown  that  the  algorithm 
can  successfully  determine  all  six  camera  param¬ 
eters.  It  converges  to  the  correct  position  given 
an  estimate  within  something  on  the  order  of  a 
meter  (less  if  more  parameters  arc  solved  for),  and 
converges  to  a  result  accurate  to  a  centimeter  or 
so  (regardless  of  the  number  of  parameters  solved 
for).  Moreover,  it  can  do  so  using  50  or  less  ref¬ 
erence  points.  Because  of  the  regular  structure 
of  the  algorithm,  the  prospects  of  carrying  out 
the  calculations  in  real  time  with  special-purpose 
hardware  seem  good. 
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Figure  1.  Camera  model  for  optical  navigation.  Camera  1  defines  the  refer¬ 
ence  picture  and  coordinate  system,  with  the  origin  at  the  “pinhole.”  For  any 
point  p  =  i  Pi  Py  ]  in  the  reference  image  there  is  a  point  q  =  [  qx  qy  q*  ]  in 
three-space  that  produced  the  image,  at  depth  z(p)  =  q*.  This  point  is  expressed 
as  u  =  [  u,  uv  u*  ]  in  the  Camera  2,  or  test,  coordinate  system.  The  relationship 
between  q  and  u  is  a  function  parameterized  by  the  six  camera  parameters  c:  three 
for  the  relative  positions  of  the  cameras  and  three  for  their  relative  orientation. 
Finally,  the  three-space  point  appears  at  the  point  v  =  [  vx  vv  ]  in  the  Camera  2 
image  plane.  The  points  p  and  v  are  said  to  correspond. 
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Figure  3.  Behavior  of  algorithm  in  one-parameter  case.  Horizontal  axis  represents 
initial  estimate  of  x  provided,  vertical  axis  represents  computed  Ax,  both  in  cm. 
Solid  curve  represents  largest  smoothing  window,  65  x  65;  others  represent  smaller 
windows. 
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Figure  4.  Left  graph  shows,  for  each  initial  value  of  pan  and  tilt,  whether  the 
algorithm  converged  to  the  correct  value  (large  boxes),  converged  to  the  wrong 
value  (small  circles),  or  failed  to  converge  (pluses).  Solid  dot  is  correct  value,  big 
rectangle  indicates  range  predicted  by  single-parameter  results.  Right  graph  is  a 
two-dimensional  slice  of  a  similar  six-dimensional  solid,  in  which  all  six  parameters 
were  solved  for. 


50  60 

Number  of  points 


Figure  5.  Graph  shows  the  absolute  error  in  x  position  on  images  smoothed  with 
9x9  window.  Each  point  represents  the  result  with  a  different  set  of  reference 
points,  distinguished  by  resulting  error  (in  cm)  on  the  vertical  axis,  and  by  number 
of  points  in  the  reference  set  on  the  horizontal  axis.  Triangles  indicate  the  case 
where  three  parameters  were  solved  for,  circles  six.  The  point  at,  the  left  end  of 
each  line  represents  a  reference  set  in  which  a  pruning  process  was  carried  out  on 
the  points  represented  by  the  right  end  of  the  line.  Large  points  represent  image 
pair  discussed  in  text,  small  points  represent  a  different  image  pair. 
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Abstract 

This  paper  presents  a  program  to  produce  object-centered 
three-dimensional  descriptions  starting  from  point-wise  30 
range  data  obtained  by  a  light-stripe  rangefinder.  A  careful 
geometrical  analysis  shows  that  contours  which  appear  in 
liglit-stripe  range  images  can  be  classified  into  eight  types, 
each  with  different  characteristics  m  occluding  vs  occluded 
and  different  camera/ illuminator  relationships.  Starting  with 
detecting  these  contours  in  the  iconic  range  image,  the 
descriptions  are  generated  moving  up  the  hierarchy  of 
contour,  surface,  object,  to  scene.  We  use  conical  and 
cylindrical  surfaces  as  primitives.  In  this  process,  we  exploit 
the  fact  that  coherent  relationships,  such  as  symmetry, 
collinearity,  and  being  coaxial,  which  are  present  among 
lower  level  elements  in  the  hierarchy  allow  us  to  hypothesize 
upper  level  elements  The  resultant  descriptions  are  used  for 
matching  and  recognizing  objects  The  analysis  program  has 
been  applied  to  complex  scenes  containing  cups  pans  and  toy 
shovels 

1.  Introduction 

The  research  presented  in  (his  paper  aims  at  producing 
object -centered  three-dimensional  descriptions  starting  from 
point-wise  31)  range  data  obtained  by  a  light  stripe  range 
finder.  While  most  of  the  initial  work  in  range  data  analysis 
was  on  generating  object  descriptions  of  simple  objects: 
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representation  of  snakes  and  dolls  made  of  cylindrical  parts 
by  means  of  generalized  cylinders  [1, 8]  or  representation  of 
polyhedra  by  planes  [11, 9],  recent  work  seems  to  be  more 
concerned  about  efficient  matching  of  objects  with  3D 
models  (3. 4],  Our  emphasis  in  this  paper,  however,  is  data- 
driven  bottom-up  autonomous  processing,  generating  object 
descriptions  from  complicated  scenes  without  referring  to 
specific  pre-stored  object  models. 

Starting  from  the  iconic  range  data,  the  descriptions  are 
generated  moving  up  the  hierarchy  of  contour,  surface, 
object,  to  scene.  We  use  conical  and  cylindrical  surfaces  as 
primitives.  While  a  top-down  verification  process  is 
important,  the  bottom-up  process  for  producing  plausible, 
natural,  object-level  descriptions  is  at  least  as  crucial  in  order 
to  realize  general  vision  systems  [7.  2]  as  the  task  world 
becomes  larger  and  less  contrived.  Our  approach  to  the 
problem  is  to  exploit  the  fact  that  coherent  relationships, 
such  as  symmetry,  collinearity.  and  being  coaxial,  that  are 
present  among  lower-level  elements  in  the  hierarchy  allow 
us  to  hypothesize  upper-level  elements.  Iliis  is  justified 
because  those  coherent  relationships  do  not  usually  occur 
accidentally  [6],  For  example,  if  the  same  surfaces  with  the 
same  relationships  appear  across  two  scenes,  they  tend  to  be 
grouped  into  one  object;  if  one  cylinder’s  axis  intersects  with 
another  one's,  like  the  relationship  between  the  handle  and 
body  of  a  pan,  they  tend  to  belong  to  the  same  object 
These  coherencies  must  be  present  because  they  have  been 
inherited  through  the  hierarchy  from  the  scene  level  down 
to  the  iconic  range-data  level.  Our  task  is  to  trace  and 
exploit  these  coherent  relationships  reversely  for 
autonomous  generation  of  object  descriptions. 


Wc  focused  our  effort  on  the  use  of  occluding  contours, 
which  can  be  extracted  quite  reliably  from  the  light-stripe 
range  data,  First.  contours  arc  extracted,  segmented  and 
classified.  From  die  coherencies  among  them,  such  as 
parallelism,  surfaces  arc  hypothesized.  These  are 
represented  as  conic  surfaces  (pipes,  cones,  and  planes). 
Hie  surface  hypotheses  arc  confirmed  or  refuted  by  their 
ability  to  account  for  observed  surface  area.  Hypotheses  of 
surface  groups  arc  formed  by  examining  coherencies  among 
the  verified  surfaces,  such  as  axis  intersections.  Finally,  such 
surface  groupings  from  multiple  scenes  arc  compared.  If  a 
similar  structure  repeatedly  occurs,  it  is  identified  as  an 
object.  ITic  succeeding  sections  of  this  paper  follow  these 
steps  in  order. 

2.  Taxonomy  of  Contours  in  Light-Stripe 
Images 

The  range  images  for  this  work  were  produced  by  a  light 
stripe  rangefinder,  which  consists  of  an  illuminator 
projecting  a  sheet  of  light  into  the  scene  and  a  camera  that 
detects  amount  of  deflection  in  the  light  stripe  on  each  scan 
line.  Triangulation  produces  surface-point  positions  in  three 
dimensions.  A  range  image  is  shown  in  Figure  1,  which  is  a 
composite  of  die  camera's  views  of  the  light  stripes,  using 
every  tenth  stripe. 

The  parallax  between  illuminator  and  camera  makes 
ranging  feasible,  but  it  also  causes  occlusions  which  are 
difficult  to  interpret.  Figure  2  shows  the  geometry  in  light- 
stripe  imaging  which  include  a  circular  object  A  and  the 
background  B.  It  explains  how  contours  arc  generated  which 


bound  a  surface  against  either  another  surface  or  a  region 
which  cannot  be  measured.  Object  A  cast  a  deep  shadow 
region  umbra,  which  cannot  be  seen  from  cidicr  the 
illuminator  or  the  camera.  It  also  eases  two  half-shadow 
regions,  penumbras,  which  might  be  seen  from  either  die 
illuminator  or  the  camera,  but  not  both.  No  data  can  be 
recorded  for  a  surface  which  lies  in  either  an  umbra  or  a 
penumbra. 

Occlusions  occur  at  die  four  combinations  where  die  line 
of  sight,  either  from  the  illuminator  or  the  camera,  is 
tangential  to  the  surface  of  the  foreground  object  A:  that  is, 
ipl-ip2.  iul-iu2,  cpl-cp2,  and  cul-cu2.  Here,  die  first  point 
of  each  pair  forms  the  occluding  contour  and  die  second  the 
occluded  contour:  therefore  there  arc  eight  types  of 
contours  in  the  light-stripe  range  imagery.  Ibis  analysis 
points  out  a  few  interesting  points.  First,  previous 
researchers  have  dealt  only  with  the  simple  occlusions 
involving  cpl-cp2  and  ipl-ip2,  by  converting  to  three 
dimensions  and  drawing  rays  from  die  illuminator  and 
camera.  It  can  be  shown  that  this  can  be  done  without 
resorting  to  three-dimensional  geometry,  by  exploiting 
information  in  die  raw  deflection  image.  Second, 
interpretation  of  the  occlusions  iul-iu2  and  cul-cu2 
provides  information  about  the  object,  even  though  the 
occluding  contour  is  not  recorded.  ITiis  is  especially  true 
and  useful  when  wc  can  assume  cylindrical  objects,  because 
die  totally  visible  region  (from  cpl  to  ipl)  is  fairly  small,  and 
expanding  the  known  part  to  the  region  from  iul  to  cul  will 
greatly  increase  the  accuracy  in  reconstructing  the 
cylindrical  shape.2 


Figure  3  shows  the  contours  detected  in  the  image  of 
Figure  1:  a)  shows  all  the  contour  points,  b)  shows  the 
occluding  points,  and  c)  and  d)  show  the  pcnumbral  and 
umbra!  occluded  points,  respectively.  The  occluding 
contours  arc  directly  useful  for  shape  cues,  while  the 
occluded  contours  provide  indirect  information  |10], 


Incidentally,  this  point  suggests  that  the  light-stripe  range  data  be  taken 
with  the  background,  as  opposed  to  the  conventional  way  in  which  the 
background  is  blacked  out  by  a  black  carpet  or  curtain. 


figure  I :  I  ight-stripc  image  of  a  scene 
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**Ru,r  2:  Generation  of  contours  with  a  liyhi-ttripe  rangefinder 


3.  Contour  Analysis 

3.1 .  Segmentation 

Kor  an  extracted  contour,  local  curvatures  along  it  arc 
calculated.  Ibe  contour  is  segmented  into  pieces  at  the 
peaks  of  the  curvature  value,  expecting  that  the  segment 
between  peaks  can  be  described  simply,  f  igure  4  illustrates 
an  example  of  contour  segmentation.  Ibe  lower  part  of  die 
figure  is  a  plot  of  curvature  vs.  position  on  the  contour, 
traced  clockwise  around  the  shovel  handle.  In  this  case,  the 
contour  has  been  divided  into  three  segments  at  the  cross 
marked  positions.  The  result  of  segmentation  for  the  whole 
occluding  contours  is  shown  in  figure  5a). 

3.2.  Segment  Classification 

fach  contour  segment  is  now  classified  into  one  of  the 
four  shape  categories:  straight,  circle,  plane,  and  space,  Kor 
this,  first  a  straight  line  is  fit  by  least-squares.  If  the  error 
residual  is  low,  it  is  classed  as  a  line.  Otherwise,  a  plane  is 
fit  If  the  plane  docs  not  fit  well,  the  contour  is  classified  as 
a  (nonplan.tr)  space  curve.  If  the  plane  fits  well,  a  circle  is  fit 
to  decide  whether  it  should  be  declared  a  circle,  figure  5  b) 
and  c)  shows  die  results  of  classification  of  the  occluding 
contour  segments  of  our  scene. 


4.  Surface  Analysis 

Once  the  contours  have  been  analyzed,  they  are  examined 
to  find  what  they  can  tell  about  the  surfaces  in  the  scene. 
Ibis  proceeds  in  dirce  steps,  first,  contour  segments  arc 
grouped  (using  coherent  relationships  among  them)  into 
ensembles  which  suggest  various  primitive  surface  shapes. 
Second,  specific  surface  types  and  equations  are  proposed 
for  the  contour  groups  together  with  their  spati.il  extent, 
finally,  the  surface  hypotheses  arc  verified  in  the  image 
data. 

4.1.  Contour  Groups 

We  seek  binary  coherence  reladonships  between  contours 
which  suggest  surface  shapes.  Ibe  relations  sought  depend 
upon  die  Shapes  of  the  surfaces,  and  include: 

•  Straight  vs  Straight 

o  Parallel 
o  Andparallcl 
o  Perpendicular 
o  Collincar 
o  Coplanar 
o  Opposed 

•  Straight  vj  Circle 

o  Perpendicular 
o  Stands-up 

•  Circle  vs  Circle 

o  Parallel 
o  Coaxial 
o  Concentric 
o  Cocircular 

Contours  sharing  appropriate  combinations  of  these 
relations  arc  aggregated  into  contour  groups.  Kach  contour 
group  suggests  a  surface,  and  is  classified  as  one  of  the 
following  types: 

•  Cone  (including  cylinder) 

•  Kibbon 

•  Disc 

•  Plane 

As  an  example,  figure  6  shows  what  kind  of  relationships 
are  involved  Tor  the  group  type  Cone.  Ibe  result  of  die 
contour  grouping  for  our  example  scene  is  shown  in  figure 
7.  whose  caption  explains  die  details  of  individual  groups. 

4.2.  Surface  Proposal 

Kach  surface  group  proposed  is  convened  to  a  surface 
equation  together  with  the  limits  in  its  spatial  extent,  figure 
8  shows  the  surface  proposals  generated  from  the 
corresponding  contour  groups  of  figure  7. 
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I  igurc  .3:  Contours  in  the  image  of  Figure  I 

4.3.  Surface  Verification 

'Hie  surface  hypotheses  arc  checked  by  going  back  to  the 
range  image  and  determining  if  each  hypothesis  can  account 
for  enough  surface  area  in  the  scene.  This  involves  a  test  of 
position  and  surface  normal  in  3*spacc  for  each  pixel.  ITic 
pixels  which  pass  this  test  arc  summed  according  to  the  area 
which  they  individually  represent  in  the  scene.  Rased  on  this 
test,  the  proposed  surfaces  arc  either  accepted  or  rejected  as 
missing.  For  example,  proposed  surface  2  (top  of  the  pan 
opening)  is  rejected,  because  no  real  surface  exists  at  the 
proposed  position.  'Ihe  surfaces  which  arc  accepted  are 
shown  together  in  Figure  9. 


Figure  4:  Segmenting  a  contour  lt»c  upper  is  the  contour  of  the 
shovel  handle  and  the  graph  k  a  plot  of  curvature  vs  position  on  the 
contour  traced  clock*  isc  Ihe  curve  is  segmented  at  the  tip  of  the  left 
peak  and  on  the  shoulders  of  the  peak  in  the  middle  Ihe  line  at  3  2 
»  ilk:  mean  curvature  for  all  contour  points  m  the  image  Ihe 
threshold  for  peak  significance  is  twice  the  mean  (upper  line)  A 
more  fortuitous  choice  would  have  segmented  at  the  nght  peak, 
where  the  handle  tapers  into  the  shovel  blade  at  lower  nghi 


Figure  5:  Results  of  contour  segmentation  and  classification:  a)  all 
contours  with  marks  at  the  segment  end  points:  b)  circular  contour 
segments:  c)  straight  contour  segments  Short  segments  are 
suppressed. 
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Figure  6:  I  he  group  type  row  and  the  relationships  mtolvcd  for  ic 
All  of  ihc  segments  and  relationships  arc  not  necessary  lo  suggest  ihc 
group 

5.  Object  Analysis 

Surfaces  arc  grouped  into  objects  in  a  similar  manner  as 
contour  grouping,  but  in  this  ease  based  on  coherent 
relationships  among  surfaces.  In  addition  to  the  obviuus 
(strong)  relationship  of  shared  contours  (i.c,  connectivity), 
we  have  considered  the  following  relationships: 
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ITicsc  relationships  arc  examined  between  pairs  of  surfaces 
and  their  evidence  is  weighted  by  dicir  strength  and 
importance.  I  hc  results  arc  summarized  as  a  graph,  such  as 
shown  in  f  igure  10,  in  which  the  nodes  arc  die  surfaces  and 
the  arcs  carry  the  weights  of  relationships.  The  surface 
groups  with  stronger  relationships  arc  shown  in  figure  11. 
for  example,  the  cup  body  and  handle  are  strongly  grouped 
together  because: 


•  The  axes  arc  nearly  parallel 

•  But  the  handle  axis  docs  tip  in  a  little,  and  its 
extension  intersects  the  extension  of  the  body 
axis. 

•  The  surfaces  arc  close  together. 

•  Ihc  surfaces  oppose  each  other  for  the  whole 
length  of  the  handle. 

•  In  the  image,  the  surfaces  have  a  path  to  bleed 
together. 

However,  grouping  at  this  level  should  not  be  considered 
final,  because  surfaces  not  belonging  lo  the  same  object  can 
exhibit  strong  accidental  alignments.  In  the  example  scene, 
the  shovel  handle  has  a  fairly  strong  relationship  with  the 
pan  body,  since  their  axes  intersect  and  the  shovel  is  inside 
the  pan.  These  arc  both  accidental  alignments.  Without 
other  knowledge,  it  is  not  possible  to  tell  whether  they 
belong  to  the  same  object.  One  way  to  resolve  this  problem 
is  to  combine  information  from  different  scenes,  which  will 
be  discussed  next. 

6.  Multiple  Scene  Analysis 

Figure  12  shows  a  different  scene  and  its  processing 
results.  Note  that  the  round-handled  cup  is  also  included  in 
this  scene,  with  a  view  angle  which  is  opposite  to  that  in  the 
previous  scene.  Ihc  extracted  surface  components  and  their 
relationships,  however,  arc  die  same  as  in  the  previous 
scene. 

In  this  way.  if  a  group  of  surfaces  can  be  identified  in 
several  different  scenes,  holding  the  same  relative  positions 
and  orientations,  one  can  assert  that  they  are  part  of  a 
common  object.  Relating  objects  rrom  different  scenes  is  a 
matching  problem,  illustrated  in  Figure  13.  Suppose  object 
/  comprising  surfaces  a  and  b  from  scene  /  matches  object  2 
comprising  r  and  d  from  scene  2.  Then  two  conditions  must 
be  met: 

•  A  surface  from  object  /  matches  a  surface  from 
object  2.  Ihis  is  called  the  surface  match. 

•  If  the  surfaces  a  and  b  arc  to  match  surfaces  c 
and  d  respectively,  then  the  placement  and 
orientation  of  b  with  respect  to  a  must  match  the 
placement  and  orientation  of  d  with  respect  to  c. 

Ihis  is  called  the  transform  match. 

The  transform  match  gets  its  name  front  the  transform 
which  maps  the  local  coordinate  system  of  b  into  the 
coordinate  system  of  a.  This  must  match  the  transform 
mappi  ng  the  coordi  natc  system  of  d  in  to  that  of  c. 
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Figure  7:  The  nesuh  of  the  contour  grouping  Tor  Figure  5  a).  There 
are  gaps  in  the  numbering  because  one  group  may  be  subsumed  in  a 
laicr  hypothesis,  as  the  binary  relations  involving  different  contours 
are  examined.  Group  2  is  formed  by  a  set  or  cocircular  contours,  and 
suggests  a  (nonexistent)  disc.  Group  4  suggests  the  fruslrum  of  a 
cone.  It  was  formed  because  the  bottom  circular  contour  was  coaxial 
with  the  top  ones,  while  ihc  straight  sides  extended  between  top  and 
bottom  Groups  8  and  12  were  aggregated  in  the  same  way,  although 
group  12  lacks  a  bottom  Groups  5.  9,  and  13  arc  discs  suggested  by 
lone  circular  contours.  Groups  6  and  1 1  arc  ribbons.  The  ribbon 
classification  avoids  loo-early  commitment  to  shape,  since  there  are 
no  cross-contours  to  indicate  whether  the  surface  is  curved.  Group  6 
lacks  the  cross-contour  because  the  contour  at  the  end  was  discarded 
as  unreliable  because  it  was  loo  short 


To  solve  the  transform  match  in  a  general  manner  is  not  a 
straightforward  calculation  of  coordinate  transformation 
matrices  for  three  reasons.  First,  when  the  component 
surfaces  include  some  symmetry,  such  as  in  cylinders  and 
cones,  then  the  placing  the  coordinate  frame  is  not  unique. 
Therefore,  the  computed  transforms  may  appear  different 


even  for  the  same  geometrical  situation.  Secondly,  the 
objects  may  have  parts  connected  by  linear  or  rotational 
articulations,  such  as  scissors.  We  need  a  method  of 
representing,  calculating,  and  comparing  transforms  which 
accommodates  objects  with  articulations.  Finally,  due  to  the 
measurement  errors,  the  calculated  transforms  won’t  be 
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Figure  8:  Surfaces  proposed  by  Ihc  contour  groups  in  figure  7 
Surfaces  4  and  8  are  internally  represented  as  nngs  cut  from  quadric 
surfaces,  but  arc  symbolically  classed  as  (fruslrums  ol)  cones  Surfaces 
6  and  11.  derived  from  rtbixut  contour  groups,  are  seen  to  be 
cylinders. 


exactly  the  same  even  when  they  should  he.  I'hns  a  method 
is  required  which  can  tell  whether  the  transform  is 
approximately  die  same.  This  is  especially  important  for  the 
eases  of  objects  including  articulations,  because  clcment-by- 

t 

element  comparison  docs  not  work.  Smith  (12|  and  Tomita 
and  Kanadc  |I3|  discuss  these  three  problems  in  more  detail, 
and  present  partial  solutions.  A  technique  of  merging 
descriptions  from  image  sequences  Inis  been  developed  also 
in  |5|  for  the  domain  of  aerial  photo  interpretation. 

I 


7.  Summary 

T  his  paper  has  described  a  method  for  31)  range-data 
analysis,  which  uses  coherencies  among  contours,  surfaces, 
and  scenes  to  generate  object  descriptions.  Specifically,  the 
following  points  have  been  discussed: 

•  Taxonomy  of  contours  in  light-stripe  images. 

ITiis  helps  us  to  understand  what  causes  each 
contour;  whether  it  occludes  or  is  occluded.  ITie 
detection  can  be  done  with  the  initial  deflection 
image,  prior  to  conversion  to  three  dimensions. 


•  Kxplicit  use  of  the  hierarchy  of  contour,  surface, 
object  and  scene  to  analyze  range-data  imagery. 

•  Development  of  a  method  for  utilizing  coherent 
relations  among  lower-level  elements  as  shape 
cues  to  aid  extraction  of  higher-level  elements. 

•  Data-driven  autonomous  generation  of  object 
descriptions  without  specific  prestored  models. 

We  plan  to  further  develop  and  implement  a  systematic 
method  of  combining  images  of  different  scenes  to  obtain 
consistent  descriptions  of  objects. 
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1.  ABSTRACT 

We  describe  an  approach  for  describing  3-D  surfaces 
by  using  the  surface  curvature.  Surface  curvature  com¬ 
pletely  determines  a  surface  Further,  we  suggest  that 
simple  properties  of  curvature,  such  as  points  and  lines 
where  principal  curvature  is  a  maximum,  correspond  to 
important  physical  properties  of  the  surface.  Several  ex¬ 
amples  are  given,  however,  the  process  of  interpretation  of 
curvature  properties  has  not  been  fully  completed. 

2.  INTRODUCTION 

We  are  interested  in  the  description  of  3-D  surfaces 
and  objects,  assuming  that  range  data  (i.e.  the  3-0 
positions)  of  the  points  on  the  visible  surface  are  available, 
by  say  the  use  of  a  laser  range  finder.  We  also  assume 
that  this  data  is  ‘dense",  in  the  sense  of  being  sampled  on 
a  certain  grid  and  not  just  at  discontinuities  (as  may  be 
the  case  for  uninterpolated  stereo  edge  data);  interpreta¬ 
tion  from  sparse  data  is  discussed  in  another  paper  from 
our  group  in  these  proceedings  [1]. 

To  generate  useful  descriptions,  we  need  a  useful 
representation.  In  general,  such  a  description  should  be 
suitable  for  the  task  of  object  recognition  and  position 
identification.  It  should  be  rich,  so  that  similar  objects  can 
be  identified,  stable,  so  that  local  changes  do  not  radically 
alter  the  descriptions,  and  have  local  support  so  that  par¬ 
tially  visible  objects  can  be  identified  It  should  also  en¬ 
able  us  to  recreate,  from  its  features,  a  shape  reasonably 
close  to  the  original  one. 

Generalized  cones  have  come  to  be  recognized  as 
an  important  class  of  representations,  that  satisfy  the 
above  requirements,  particularly  for  complex  objects,  which 
are  described  as  assemblies  of  smaller  objects  [2]. 
However,  generalized  cones  are  'volume*  descriptions  and 
may  not  be  suited  for  objects  that  are  essentially  surfaces, 
such  as  a  metal  sheet,  or  for  relative  smooth,  'featureless* 
surfaces  such  as  a  turbine  blade.  In  this  paper,  our  inter¬ 
est  is  in  inspection  of  such  high  precision  surfaces  (our 
representation  may  also  turn  out  to  be  an  important  step 
in  generating  generalized  cone  descriptions). 

Our  approach  to  describing  surface  is  to  consider 
the  curvature  of  the  surface  described  by  the  range  image, 
and  more  specifically  to  identify  and  extract  significant 


changes  of  this  curvature.  It  is  therefore  related  to  the 
‘Curvature  Primal  Sketch'  of  Asada  and  Brady  (3)  describ¬ 
ing  closed  planar  curves,  but  we  are  now  describing  sur¬ 
faces  in  3-0.  Our  approach  allows  for  description  of  local 
properties,  in  contrast  to  global  methods  such  as 
'Extended  Gaussian  Images*  [4]. 

3.  CURVATURE  REPRESENTATION 

3.1.  Mathematical  Background 

A  curve  in  E3  (3  dimensional  Euclidian  space)  is 
uniquely  determined  by  2  local  quantities  called  curvature 
and  torsion.  Similarly,  a  surface  in  £3  is  determined  by  2 
local  invariant  quantities  called  the  first  and  second  fun¬ 
damental  forms  [5]. 

Let  z»f(x,y)  be  a  coordinate  patch  on  a  surface  of 
class  >  1  (f  is  in  C1  and  the  rank  of  the  Jacobian  matrix  is 
2).  By  convention,  let  dz=fxdx  +  fydy,  with  fa«3s/3x 
fy-3z/3x 

dz  has  the  property  that 

f(x+dx,  y+dy)  »  z  +  dz+  o((dx2+dy2)1/2)  (1) 

So.  dz  is  a  1st  order  approximation  of  f(x*dx. 
y+dy)-z 


Consider  the  quantity 

l*dz,dz«(t<dx  +  fydyj’tf^dx  ♦  fydy) 

-(Vgdx2  ♦  2(»/fy) 
dxdy+(1y*fy)dy2 

-Edx2  +  2F  dxdy  ♦  G  dy2  (2) 

where  E-f,, •*„  F«fx*fy  0”%’%  (3) 

I  is  known  as  the  first  fundamental  form. 

Let  us  now  suppose  that  z«f(x,y)  is  a  patch  of  class 
>2.  We  can  define  a  unit  normal  at  each  point 
N-fXfy/tfXfy|,  then,  dN-NKdx  ♦  Nydy 
Consider  the  quantity 

II  •  -d**dN  »  -(f.dx  ♦  fydy)*(N„dx  +  Nydy) 


i  -  -I,  *  N„dx2-(f„  •  N/f  •  N„)  dxdy 
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w 


•j  r  •  ^  j' 


Ldx2  ♦  2M  dxdy  +  Ndy2 


(4) 


where  L*-f„  *N  «f  *N 


N~WVV 


(5) 


II  is  known  as  the  second  fundamental  form,  and  £*1/211  is 
called  the  osculating  paraboloid  at  each  point  P  and  its 
nature  qualitatively  describes  the  nature  of  the  surface 
around  P  based  upon  the  discriminant  LN-M2. 

-  Elliptic  case:  LN-M2>0.  Here,  6  as  a  function  of 
(dx.dy)  is  a  elliptic  paraboloid  as  shown  in  figure  1(a). 
The  surface  lies  on  one  side  of  the  tangent  plane 

-  Hyperbolic  case:  LN-M2<0.  6  is  a  function  of 

(dx.dy)  is  a  hyperbolic  paraboid  as  shown  in  figure 
1(b).  There  exist  2  lines  in  the  tangent  plane 
through  P  which  divide  the  tangent  plane  into  4  sec¬ 
tions  in  which  £  is  alternately  positive  and  negative. 

-  Parabolic  case:  LN-M2»0  and  L2+N2*M2P  0.  £  as  a 
function  of  (dx.dy)  is  a  parabolic  cylinder,  as  shown 
in  figure  1(c).  There  is  a  single  line  in  the  tangent 
plane  through  P  along  which  fi«0,  otherwise  £  keeps 
a  constant  sign. 

-  Planar  case:  L*M«N*0.  Here  fi*0  for  all  (dx.dy). 

At  each  point,  surface  shape  can  be  completely 
described  by  the  six  coefficients  E.F.G.L.M.N  mentioned 
above.  These  coefficients,  however,  are  not  independent, 
and  it  seems  better  to  use  more  meaningful  descriptors, 
such  as  the  principal  curvatures. 

Let  us  draw  a  curve  C  of  class  C2  onto  our  surface, 
passing  through  point  P.  The  normal  curvature  kn  of  C  at  P 
is  the  projection  of  the  curvature  vector  k  of  C  at  P  on  N. 
kn-k*N. 

It  is  easy  to  prove  that  kn*ll/l. 

The  two  directions  for  which  the  values  of  kn  take 
on  maximum  and  minimum  values  are  called  the  principal 
directions,  and  the  corresponding  normal  curvatures,  k, 
and  x2.  are  called  the  principal  curvatures. 

The  principal  directions  can  be  shown  to  be  or¬ 
thogonal.  and  that  a  number  k  is  a  principal  curvature  if 
and  only  if  k  is  a  solution  of  the  equation 

(EG-F2)k2-(EN+GL-2FM)k+(LN-M2)-0  (6) 


3.2.  Choice  of  Representation 

For  shape  description,  it  is  not  sufficient  that  the 
representation  specify  the  original  shape  completely.  We 
propose  to  use  certain  points  and  lines  that  have  distin¬ 
guished  and  invariant  properties,  and  the  process  of 
description  requires  making  these  points  and  lines  explicit. 
They  can  be  useful  for  matching  (registration)  or  as  the 
key  steps  in  computing  higher  level  descriptions,  such  as 
generalized  cones  We  suggest  that  the  distinguished 
points  are  essentially  the  points  of  extremal  curvature,  or 
those  of  curvature  changes,  and  the  lines  connecting  such 
contiguous  points.  We  investigate  the  properties  of  such 
points  and  lines  for  specific  configurations  in  surfaces 
below. 

For  the  curvature  at  a  point,  we  will  use  the  principal 
curvatures  (k,  and  <2)  and  the  Gaussian  curvature  (x,ic2). 
The  principal  curvatures  have  a  magnitude  as  well  as 
direction,  with  the  two  principal  directions  being  or¬ 
thogonal. 


4.  COMPUTING  CURVATURE 

First,  a  few  comments  on  computing  curvature. 
Since  we  have  discrete  data,  we  must  compute  differences 
rather  than  derivatives.  We  compute  the  first  differences 
in  the  x  and  y  directions  by  convolving  the  range  images 
with  masks  as  shown  in  Fig.  2(a).  The  second  differences 
in  x.y,  and  the  cross-derivative  32f/3x3y  are  computed  by 
convolving  with  masks  shown  in  Fig.  2(b).  These  masks 
are  obtained  by  differentiating  Lagrange’s  polynomials. 
The  principal  curvatures  and  their  orientations  are  obtained 
by  solving  equations  (6)  and  (7). 

Computation  of  curvature  is  likely  to  be  highly  noise 
sensitive  To  decrease  the  effects  of  noise,  we  can  use  a 
large  support  for  computing  differences,  at  a  possible  cost 
in  accuracy  of  localization.  We  convolve  the  image  with 
rotetionally  symmetric  Gaussian  masks  of  different 
variance,  a.  The  different  sizes  of  the  filter  give  us  cur¬ 
vature  at  different  scales,  and  are  in  fact  helpful  in  inter¬ 
preting  the  result$.(as  in  [3]  and  [6]  for  other  applications). 
The  width  of  the  mask  is  chosen  such  that  the  volume  un¬ 
der  the  truncated  Gaussian  is  very  close  to  1  (6o  is  a 
good  approximation). 

For  example,  if  o*1,  the  5x5  filter  has  a  volume  of 
0.982  and  coefficients  in  the  upper  left  corner  shown  in 
figure  3. 


A  direction  (du.dv)  is  a  principal  direction  at  P  if  and  only 

if  du  and  dv  satisfy  5.  INTERPRETING  CURVATURE 


(EM-LF)du2+(EN-LG|  dudv*<FN-MG)dv2-0  (7) 

We  also  need  to  define  the  Gaussian  curvature  K  at  P 
LN-M2 

K»k,k2- -  (8) 

EG-F2 

We  can  also  prove  that  EG-F2>0,  therefore  the  sign 
of  K  agrees  with  the  sign  of  LN-M2  Thus,  a  point  on  a 
surface  is  elliptic  if  and  only  if  K>0,  hyperbolic  if  and  only 
if  K<0.  parabolic  or  planar  if  and  only  if  K*0. 


5.1.  Introduction 

We  are  interested  in  three  major  properties  of  sur¬ 
faces: 

-  Range  Discontinuities  (or  -jump”  boundaries):  these 
are  typically  the  occluding  contours  of  a  surface 

-  Surface  Discontinuities:  these  correspond  to  folds  or 
cuts  in  a  surface,  such  as  the  edges  of  a  polyhedral 
object 

-  Points  of  curvature  maxima:  these  are  physical  in¬ 
variants  of  a  surface 
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These  are  the  properties  of  the  surfaces  that  we 

wish  to  identify,  we  now  need  to  verify  that  they  translate 
easily  into  curvature  properties,  which  is  what  we 

measure. 

The  first  and  simplest  attribute  of  curvature 
properties  is  the  sign  of  the  Gaussian  curvature.  There 
have  been  empirical  observations  that  the  shape  of  the 
regions  of  constant  Gaussian  curvature  sign  roughly 
reflect  the  overall  shape  of  an  object  [7].  We  will  there¬ 
fore  detect  zero-crossings  of  the  Gaussian  curvature, 
together  with  their  angle.  This  feature  allows  us  to 
reconstruct  exactly  the  region  of  constant  Gaussian  cur¬ 
vature  sign. 

Gaussian  curvature,  however,  captures  only  a  small 
part  of  the  total  information,  and  is  totally  inappropriate  to 
describe  the  large  subclass  of  cylindrical  objects.  Indeed, 
the  Gaussian  curvature  is  0  for  such  objects,  regardless  of 
the  cross-section  shape.  Such  objects  are  then  described 
by  the  variations  of  the  maximum  principal  curvature,  and 
more  succinctly  by  the  location  of  th»  zero-crossings  and 
maxima  of  this  function 

These  are  the  features  we  will  be  using  for  shape 
description:  zero-crossings  of  the  Gaussian  curvature, 

zero-crossings  and  maxima  of  the  largest  principal  cur¬ 
vature 

In  the  introduction,  we  suggested  that  objects  are 
naturally  segmented  at  the  points  of  maximum  curvature, 
with  no  mention  of  sign  changes  The  explanation  is 
simple:  the  curvature  we  mentioned  previously  is  the  one 
we  perceive  by  rolling  a  finger  on  the  object,  for  example, 
and  not  the  one  we  compute  using  digital  filters.  In  the 
case  of  a  jump  boundary,  for  instance,  we  will  see  that, 
due  to  the  digital  computations,  the  discontinuity  is  lo¬ 
cated  by  the  zero-crossing  position  between  a  large  posi¬ 
tive  and  a  large  negative  response. 

Let  us  now  generate  some  instances  of  interesting 
surface  properties  and  observe  the  corresponding  cur¬ 
vature  transitions. 

5.2.  Some  Special  Cases 

5.2.1.  Step  edge 

An  example  of  step  edge  is  shown  in  figure  4.  We 
can  consider  it  as  a  20  problem,  the  minimum  curvature 
being  0  almost  everywhere  (planar  patch).  In  the  con¬ 
tinuous  case,  the  first  derivative  should  be  0  everywhere, 
except  at  the  discontinuity  where  we  have  a  delta  function. 
The  second  derivative  is  also  0  everywhere,  except  at  the 
discontinuity  where  we  have  2  delta  functions  of  opposite 
sign.  In  the  discrete  case,  the  maximum  curvature  is 
characterized  by  2  maxima  of  opposite  sign  and  a  zero¬ 
crossing  at  the  discontinuity  location.  As  o  increases,  the 
value  of  these  maxima  decreases  and  they  move  a  way 
from  the  (fixed)  zero  crossing  locations  as  shown  in  figure 
5. 

5.2.2.  Surface  discontinuities 

This  corresponds  to  a  fold  in  a  surface.  We  can 
identify  different  types  of  such  an  instance,  depending  on 


the  sign  of  the  curvature  on  each  side  of  the  fold,  as  il¬ 
lustrated  in  figure  6.  The  common  characteristics  are  a 
maximum  of  the  curvature  near  the  location  of  the  fold, 
decreasing  with  larger  values  of  o.  Zero-crossings  may 
appear,  depending  on  the  curvature  sign,  as  shown  in 
figure  7,  representing  the  variation  of  x,  for  the  fold 
shown  in  figure  6(c).  If  the  curvature  is  0  on  either  side, 
we  have  a  polyhedral  edge,  and  the  corresponding  be¬ 
havior  of  x,  is  shown  in  figure  8  for  different  values  of  o. 

5.2.3.  Maximum  of  curvature 

This  is  typically  illustrated  by  an  elliptical  cylinder, 
where  the  maximum  curvature  keeps  a  constant  sign  and 
goes  through  a  very  smooth  maximum. 

5.2.4.  Others 

We  could  define  more  types  of  transitions,  as  in  (31 
some  of  them  ambiguous  at  certain  scales;  it  is.  however, 
more  interesting  to  concentrate  on  the  problems  arising 
when  the  Gaussian  curvature  is  non-zero 

The  case  of  positive  Gaussian  curvature  is  rather 
straightforward,  and  only  maxima  of  x,  can  appear  in 
these  zones.  False  maxima  may  appear  in  the  neighbor¬ 
hood  of  umbilical  points  (for  which  x,-x2).  when  x,  and 
x2  exchange  their  role,  undergoing  a  sudden  90°  change 
in  orientation. 

In  negative  Gaussian  areas,  the  situation  is  more 
complex:  Around  points  where  the  magnitude  of  x,  and 
x2  is  the  same,  we  can  have  reversals  of  (x,,x2).  that  is 
the  larger  curvature  becomes  x2.  undergoing  a  90° 
change  and  creating  a  zero-crossing  in  the  curves 
representing  the  maximum  and  the  minimum  principal  cur¬ 
vatures.  even  though  the  surface  itself  is  smoothly  chang¬ 
ing.  This  effect  is  shown  in  the  'vase'  example 
6.  RESULTS 

Our  implementation  so  far  consists  of  the  computa¬ 
tion  of  the  principal  curvatures,  the  extraction  of  zero 
crossings  from  the  Gaussian  curvature  and  the  maximum 
curvature  images,  and  the  localization  of  the  maxima  in 
the  maximum  curvature  image.  We  have  chosen  two  rep¬ 
resentative  examples  to  illustrate  the  strong  link  between 
curvature  behavior  and  the  features  we  wish  to  detect: 

The  first  one  is  the  range  image  of  a  straight 
cylinder  with  an  hexagonal  base.  It  is  a  good  example  of 
zero  Gaussian  curvature  surface,  showing  polyhedral  edges 
and  jump  boundaries. 

The  second  one  is  the  range  image  of  a  vase,  which 
can  be  thought  of  as  a  Straight  Homogeneous  Circular 
Generalized  Cylinder  [8],  meaning  that  it  has  a  circular 
cross  section,  and  a  straight  axis  It  presents  zones  of 
positive  and  negative  Gaussian  curvature,  jump  boundaries, 
surface  discontinuities  and  zero-crossings  due  to  reversals 
of  (x,.x2). 

Figures  9  and  10  present  the  processing  performed 
on  each  of  these  range  images,  and  are  organized  as  fol¬ 
lows: 

(a)  is  a  graphic  presentation  of  tha  object  against  the 
background 

(b)  is  a  ’needle  map*  representation  of  the  surface 
orientation,  obtained  from  tha  first  order  differences 


(c)  is  •  representation  of  ic,,  the  maximum  principal 
curvature.  The  length  of  the  needle  is  proportional 
to  the  magnitude  of  the  curvature.  The  curvature 
displayed  here  is  the  one  obtained  without  any 
smoothing  of  the  data. 

(d-g)  show  the  zero-crossings  of  ic,  after  smoothing  the 
original  data  with  a  a  of  0,  0.5,  1  and  2  respectively. 

(h-k)  show  the  maxima  of  x ,,  after  smoothing  the  original 
data  with  a  a  of  0,  0  5,  1  and  2  respectively 

Finally,  figures  10(i-o)  show  the  zero-crossings  of 
the  Gaussian  curvature  for  the  'vase*  image.  For  the 
hexagon  image,  the  Gaussian  curvature  is  0  everywhere.  1. 

We  can  make  the  following  comments  on  these 
figures: 

-  The  horizontal  lines  in  figures  9(d)  through  9(g)  2. 

detect  zero-crossings  of  ic..  which  varies  in  these 

areas  between  -10~7  and  +10'7,  so  that  these  lines 
should  be  ignored. 

-  In  this  first  example,  we  clearly  see  that  the  maxima  ^ 
of  ic,  identify  the  polyhedral  edges  accurately  for  all 
values  of  o  Also,  the  jump  boundary  corresponds 

to  a  zero-crossing  of  ic,  flanked  by  two  maxima  of 
opposite  sign,  receding  from  the  zero-crossing  as  o 
increases,  as  predicted.  4 

-  On  the  vase  example,  we  see  that  the  computation 
of  curvature  is  very  sensitive  to  local  distortions, 
such  as  quantization  effects  visible  in  10(a)  and 
reflected  in  the  processed  views  10(d).  10(h)  and  5. 
10(1). 

-  The  Gaussian  curvature  is  a  good  qualitative 

description  of  the  object  as  shown  in  10(l-o)  6- 

-  It  is  the  zero-crossings  of  ic,  (and  not  of  the  Gaus¬ 
sian  curvature)  which  accurately  correspond  to  the 
non-jump  boundary  of  the  object,  and  the  accuracy 
decreases  as  o  increases. 

-  In  the  bottleneck  area  of  the  vase,  we  detect  zero- 
crossings  of  ic,,  even  though  no  significant  changes 
of  the  surface  occur  They  are  detected  as  a  con¬ 
sequence  of  the  'swap'  between  k,  and  tc2.  since 

the  two  principal  curvatures  have  opposite  sign,  and  g 
one  accompanied  by  a  90°  rotation  of  k,  and  <2. 

This  phenomenon  is  well  illustrated  by  figure  11. 
showing  a  profile  of  ic,  and  ic2  along  column  39. 

7.  FUTURE  RESEARCH 

Obviously,  the  material  described  in  this  paper  only 
represents  the  groundwork  for  a  complete  feature  extrac¬ 
tion  task,  and  serves  as  a  validation  of  our  ideas  regarding 
curvature  as  a  powerful,  local  support  primitive 
What  needs  to  be  done  is 

-  The  interpretation  of  significant  zero-crossings  and 
maxima  as  jump  boundaries,  edges  or  simply  ex¬ 
trema.  For  instance,  the  zero-crossing  of  <,  as¬ 
sociated  with  the  jump  boundary  of  the  vase  ex¬ 
ample  is  not  accompanied  by  a  90°  angle  change, 
except  for  o-2 

-  The  integration  of  the  detected  primitives  using 
various  values  of  a.  This  step  will  also  be  useful  to 


perform  the  interpretation:  the  maxima  for  a  jump 
boundary  move  away  as  o  increase. 

-  The  linking  of  identified  primitives  into  connected 
components. 

-  The  reconstruction,  from  our  features,  of  an  object 
which  should  be  reasonably  close  to  the  original. 

-  The  abstraction  of  such  features,  leading  to  a  con¬ 
cise.  natural  description  of  the  object,  such  as  a 
generalized  cone. 
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Figure  1:  Different  types  of  surface 
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Figure  2:  Masks  to  compute  1st  and  2nd  order  differences 
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Figure  3:  Gaussian  filter 
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ABSTRACT 

A  new  feature  based  approach  to  the  correspondence  problem  is  pre¬ 
sented.  The  system  described  is  a  general  system  which  can  be  used 
both  for  motion-sequences  and  stereo-pairs.  It  is  shown  how  by  group¬ 
ing  and  grading  01  features  tne  matching  process  can  be  guided  by  a 
plan  to  deal  with  ambiguous  feature  constellations  in  an  "intelligent" 
way. 

INTRODUCTION:  A  VISION  SYSTEM  FOR  A  MOBILE  ROBOT 
The  ASTER1X -System  is  planned  to  be  part  of  the  vision  system  for 
a  mobile  robot.  The  robot  will  be  equipped  with  a  pair  of  stereo- 
cameras,  acoustic  sensore  and  a  laser  range  finder.  Hence  the  vision 
system  has  to  handle  stereo  pairs  as  well  as  motion  sequences  and 
should  be  able  to  incorporate  depth  clues  from  other  sensors. 

Both  motion  and  stereo  vision  have  the  correspondence  problem  in 
common.  The  correspondence  problem  in  both  domains  differs  only  in 
the  type  of  the  constraints,  which  can  be  applied  to  solve  the  problem. 
The  main  task  however,  selecting  features  in  all  frames  of  a  motion 
sequence  or  a  stereo  pair,  is  the  same.  That  is  why  the  ASTERIX- 
system  is  based  on  a  general  correspondence  algorithm  which  can  be 
applied  to  motion  sequences  as  well  as  to  stereo  pairs. 

Specialized  knowledge  about  disparity  constraints  is  kept  in  sepa¬ 
rate  program  modules  (constraint  sources),  which  evaluate  e.g.  epipo- 
lar  lines  (stereo)  (l,  6|  or  velocity  predictions  (motion)|©|.  These  sep¬ 
arate  constraint  sources  provide  also  an  easy  way  to  feed  constraints 
derived  from  other  kinds  of  sensors  into  the  system.  Thus  a  change 
in  the  robot  configuration  affects  only  the  constraint  sources  and  the 
rules  how  to  apply  them,  but  not  the  general  concept  of  the  system. 

Another  advantage  of  this  system  structure  is  that  in  the  case  of 
motion  sequences  the  constraint  sources  can  change  according  to  the 
knowledge  acquired  so  far.  In  the  first  frames  of  an  image  sequence 
only  very  gener;  I  heuristic  constraints  can  be  used,  but  as  more  and 
more  about  the  scene  is  known  more  sophisticated  the  constraints  can 
be  applied. 

The  matching  is  based  on  local  features  like  comers  and  dots  (5),  L-, 
Y-,  X-  and  T-junctions  [11,  12]  for  the  following  reasons: 

•  The  extraction  of  these  local  features  can  be  done  very  fast,  since 
most  of  the  required  computation  time  is  used  for  the  convolution 
of  the  image  with  constant  operators  |S],  which  can  be  done  with 
an  array  processor  nearly  in  real-time. 

•  Since  the  purpose  of  the  matching  is  to  compute  the  1D- 
coordinates  of  the  physical  object  corresponding  to  the  matched 
features,  it  is  very  convenient  to  match  features  which  have  unique 
coordinates,  like  comers  or  dots. 

•  On  the  other  hand  should  the  features  to  be  matched  provide 
enough  information  to  reduce  search  time  and  ambiguities  (com¬ 
pare  with  approaches  which  rely  entierly  on  grayvalue  correlation 

l«,  7,  *1). 


•  Small  local  features  are  less  likely  to  be  distorted  by  occlusion  or 
perspective  distortions  than  features  that  extend  over  large  areas 
of  the  images  Hence  they  are  easier  to  match  than  higher  level 
features  like  polygons,  e.g.  if  from  a  square  only  three  comers  are 
visible  you  cannot  match  a  square  but  you  still  can  match  three 
comers  of  the  square. 

•  The  classification  of  the  comers  allows  to  exclude  T-junctions 
from  the  stereo  analysis,  which  might  be  caused  by  occlusion  and 
must  not  be  used  for  depth  measurements. 

The  matching  strategy  in  ASTERIX  is  the  most  important  part  of 
the  system.  ASTERIX  first  inspects  the  feature  constellations  in  both 
frames  and  plans  a  matching  sequence  which  seems  to  be  the  most 
promising  for  the  situation. 

Consider  for  example  the  image  of  a  checker  board:  All  comers  in 
the  inner  part  of  the  board  have  look-alike  neighbors.  Any  matching 
strategy  based  on  optimizing  a  similarity  measure  within  a  local  neigh¬ 
borhood  are  bound  to  come  up  with  ambiguous  matches  or  worse,  will 
force  an  arbitrary  decision  and  select  the  best  match  according  to  the 
similarity  measure,  which  might  depend  purely  on  the  noise  in  the 
image.  Matching  strategies  based  on  a  global  optimization  criterion 
[2,  IS]  on  the  other  hand  have  the  tendency  in  case  of  ambiguities  to 
force  occasionally  some  wrong  matches  for  the  sake  of  a  globally  bette~ 
optimization.  Ullman’s  "minimal  mapping*  will  get  the  matches  for 
the  checker  board  totally  wrong  if  the  disparity  is  similar  to  the  size 
of  the  squares. 

ASTERIX  uses  a  different  strategy:  The  system  first  scans  both 
frames  independently  and  groups  the  features  into  classes  of  similar 
features,  which  might  cause  ambiguities.  In  the  case  of  the  checker 
board  ASTERIX  would  see  immediately  that  there  are  lots  of  look- 
alike  comers  in  the  inner  parts  of  the  board,  but  that  the  four  comers 
of  the  board  itself  are  unique.  So  the  system  starts  matching  by  match¬ 
ing  one  of  these  unique  comers  first.  Then  the  neighboring  comers  are 
matched  by  following  the  edges  that  connect  them,  until  all  parts  of 
the  image  are  matched  or  no  more  constraints  can  be  derived.  Ambi¬ 
guities  which  cannot  be  solved  at  this  stage  of  analysis  are  forwarded 
to  the  higher  levels  of  analysis,  rather  than  forcing  a  decision  based 
on  low  level  clues  only. 

The  way  ambiguities  are  solved  in  ASTERIX  is  related  to  graph 
matching  techniques  like  the  tracking  algorithm  used  by  Radig  and 
(10).  Both  strategies  are  feature  based,  evaluate  all  possible  am¬ 
biguities  and  derive  constraints  from  structural  knowledge  like  neigh¬ 
borhood  relations.  The  difference  is  that  graph  matching  strategies 
compare  features  between  the  frames  to  be  matched  whereas  ASTERIX 
groups  the  features  and  compares  them  mainly  within  the  frames. 
Grahp  matching  is  very  mechanical  whereas  the  grouping  generates 
a  plan  for  a  data-driven  matching  sequence. 

The  planning  of  the  matching  sequence  in  ASTERIX  can  be  com¬ 
pared  to  the  way  a  child  tries  to  solve  a  jigsaw  puzzle.  Usually  the 
child  will  start  with  pieces  that  catch  the  eye  because  they  are  strik¬ 
ingly  different  from  all  other  pieces  in  either  shape  or  color,  and  the 
more  advanced  child  will  organize  his  pieces  and  sort  them  according 
to  color  and  shape. 
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FEATURE  EXTRACTION  AND  GROUPING 
Fsxture  extraction  i>  done  in  three  steps:  First  an  interest  operator  [5] 
and  an  edge  detector  [12]  are  applied  to  the  images  and  basic  features 
(edgels  and  points  of  interest:  dots,  corners)  extracted.  The  interest 
operator  labels  the  points  of  interest  as  dark  objects  on  a  light  back¬ 
ground  or  vice  versa  and  computes  also  the  orientation  of  the  corners 
and  a  measure  of  the  quality.  The  edgels  are  labeled  with  the  aver¬ 
age  grayvalues  on  both  sides  of  the  edge,  the  orientation,  and  also  a 
quality  measure.  All  these  descriptions  are  used  to  match  features. 

In  a  second  step  edgels  are  tracked  to  find  edges,  their  endpoints 
and  intersections.  Neighboring  endpoints,  intersections  or  points  of 
interest  are  combined  to  new  features  and  if  possible  labeled  according 
to  their  junction  type  as  (Y,  L,  T,  X..)  f 1 1 J. 

Finally,  the  features  found  in  each  frame  are  grouped  into  classes 
of  features  which  are  similar  according  to  the  similarity  function  which 
is  used  for  matching.  This  grouping  it  done  by  computing  the  minimal 
spanning  tree,  so  that  the  number  of  groups  is  variable. 

The  similarity  function  is  based  on  the  weighted  average  of  the 
similarity  of  the  basic  features;  to  compute,  for  example,  the  similarity 
of  two  Y -junctions,  the  similarities  of  the  three  edges  meeting  at  the 
junction  to  there  corresponding  edges  are  averaged. 

GRADING  AND  MATCHING 

Since  the  features  are  grouped  into  equivalence  classes,  matching  takes 
place  between  classes  of  features  rather  than  between  features  itself. 
The  first  step  of  matching  is  grading  of  the  classes  according  to  abun¬ 
dance  and  prominence  of  features.  The  matching  sequence  is  deter¬ 
mined  by  the  grades  of  the  classes.  The  classes  with  only  few  but  very 
prominent  features  (high  contrast)  are  matched  first  to  get  reliable  and 
unique  feature  matches.  From  these  initial  matches  new  constraints 
are  derived  and  propagated  into  the  more  ambiguous  classes.  The  con¬ 
straint  propagation  follows  the  edges  which  connect  the  comers  and 
junctions. 

The  grouping  arid  grading  of  features  does  not  necessarily  reduce 
the  computation  time  for  the  matching.  The  comparison  of  features 
within  the  frames  takes  about  the  same  time  as  the  comparison  of 
features  between  the  frames,  but  the  grouping  process  saves  the  results 
of  the  comparisons  in  a  more  useful  way,  so  that  a  plan  can  be  derived 
to  solve  ambiguous  constellations. 

IMPLEMENTATION  AND  RESULTS 
The  edge  operator  and  the  interest  operator  are  implemented  in  C, 
all  other  parts  of  the  system  are  running  in  SUSP.  The  hardware 
implementation  of  the  convolution  is  in  progress. 

The  feature  extraction  and  the  grouping  and  grading  of  features 
have  been  tested  on  real-world  scenes  (machine  parts,  telephone  )  with 
quite  satisfactory  results;  the  constraint  system  is  still  under  develop¬ 
ment,  and  the  matching  has  been  tested  so  far  only  on  artificial  data, 
but  further  results  are  expected  soon. 
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Abstract 

Thin  paper  describe*  an  intelligent,  user-friendly  geometric  mod¬ 
eling  system  which  enables  simple,  fast  building  of  HD  models  for  the 
Image  Understanding  system.  First ,  an  instance  of  an  object  is  taught 
through  the  operations  in  which  (Generalized  Cylinders  are  Jitted  to  each 
jxirt  of  an  example  of  the  object  in  HD  space  formed  by  stereo  images. 
Model  descriptions  of  the  instance  are  generated  automatically  after  the 
operations.  Second,  class  model  of  the  object  is  learned  inductively  using 
the  descriptions  of  the  instances  and  mode1,  descriptions  of  the  class  are 
also  generated.  The  generated  model  descriptions  can  be  used  as  input 
of  the  ACRONYM  system. 

1.  Introduction 

As*  the  research  of  model- hascel  image  understanding  system*  au- 
vmires  mid  these  systems  come  to  he*  applied  to  the  recognition  of  vari¬ 
ous  objects,  an  intelligent  modeling  system  which  ran  Imild  3D  models 
efficiently  is  ben  miming  necessary.  Namely,  a  modeling  system  which 
enables  simple,  fast  model  building  is  required.  Also,  the  system  must 
*M*  M***y  u>  fcwn  in  terms  erf  how  to  build  31)  models,  and  user-friendly 
from  the  viewpoint  of  the  user-in terfaco.  A  system  for  geometric  mod¬ 
eling  called  Stereo  Modeling  System  which  satisfies  these  requirement 
has  been  designed  and  built. 

In  the  ACRONYM  system,  31)  models  were  built  through  the  text- 
based  description  language.1  The  problem  was  that  model  descriptions 
must  be  inode  by  measuring  various  parameters,  such  as  the  sir.e  of  a 
part  of  the  object  and  the  angle  between  parts,  ami  by  writing  out  long 
model  description*.  For  •  xnrnple.  to  build  a  model  or  a  relatively  simple 
object  like  an  aircraft,  a  couple  of  hundreds  lines  of  model  descriptions 
were  required.  It  is  not  easy  for  a  person  who  dm*  not  have  enough 
knowledge  of  computers  to  handle  the  modeling  language.  Moreover,  as 
the  model  descriptions  are  made  by  hand,  errors  in  model  descriptions 
are  likely  to  occur.  In  order  that  a  modeling  system  may  lie  used  widely 
ami  In*  applied  to  various  objects,  it  is  ideal  that  even  a  person  who 
is  not  a  computer  expert  can  build  3D  models  «f  objects  easily  and 
without  errors. 

The  3D  model  has  also  been  studied  in  the  fields  of  CAD  and 
Computer  (Iraphics.  However,  the  major  difference  between  3D  models 
in  these  fields  e-spe-cially  in  CAD  and  in  Image  Understanding  is  that 
in  the  latter,  objects  to  be  modeled  actually  exist  in  the  world.  With 
this  advantage,  the  system  Imild*  models  of  object  instance  and  class 
from  actual  examples  ol  the*  ohje*e*t  through  tin*  following  two  successive 
stages. 

1)  A  model  of  an  object  instance  is  taught  interactively  in  3D  space 
formed  by  sterm  images  of  iui  example*  ami  a  stereoscope,  i.c.,  gc- 
oiuctric  feature**  of  the  object  are*  specified  elircctly  in  3D  space  via 
the  u*e*r-friemlly  interface*,  for  instance*  voice*  ami  curse>r,  without 
.symbolic  descriptions,  utilising  prior  knowledge  of  the  eihje*ct. 

2)  The*  syste*in  lc*arus  the*  model  of  the*  eibje*e't  class  imluctive*ly  through 
the  imieh  Ls  e»f  instances,  auel  builds  mode*l  descriptions  of  the  object 
class  automatically. 

The  meide'l  building  e>f  object  installer  is  elone  by  fitting  Crncr- 
alixe-el  Cylinders  te>  each  part  «»f  the  example  in  31)  super,  based  e>n 
comparisem  ol  the*  displayed  Cciie*rulixe*el  Cylinder  and  the*  actual  ob¬ 


ject.  The  le*ncher  elivieles  the*  object  irato  se*ve*ral  parts  accoreling  to  the 
re*qnire*el  accuracy  e»f  approximation,  specifies*  re  lations  between  parts, 
ami  builds  the*  me>elrl  e»f  the*  wlieile  parts  e*lli»  iently,  making  use  of  the 
descriptions  <»f  the*  parts  which  .ire*  alreaely  defined.  The*  syste  m  caku- 
late**  pariuue*te*rs  such  ft*  the  sixe*  erf  the*  part  or  the  angle  between  parts 
ami  generates  imule  1  descriptions  of  the  obje-et  instance.  Section  2  de¬ 
scribe*  the*  ele*taiis  e»f  how  to  huilel  instance*  inoelcb.  (Mass  mode  ls  are 
built  by  geue  ralixing  model  e  loser  ipt  inns  which  are  e*xpre*sse*el  in  terms  erf 
algehraie  ceuistraints.  Class  meieleL*  arc  alsei  specialized  by  adeling  ne*w 
eemstraints  obtaiue*el  through  inisreveiguitiem  e*xpe*rie*m*c*.  The  de-tails 
will  lie*  described  in  Sertiem  3.  Specifying  gesHiietrie  feature*  erf  eihjccts 
elire*ctly  in  3D  space*  <uiel  teaching  instane  e*  model  using  <ui  e  xample*  of 
the  object  is  the  uieist  intuitive  way  te>  huilel  3D  uiodt'ls.  More-over, 
making  erreirs  in  imide*l  drseripl  ions  run  he*  pre*ve*ute*d  l»y  the  proceelurc 
in  whie  li  a  building  3D  iueiele*l  is  elis|>laye*el  auel  always  remipareel  to  the 
ae  Luul  example*.  In  Se*etiem  4.  the*  user  iiiterfiue  is  dese  rihe-el  in  eh'tail 
through  the*  actual  example*.  The*  ge-ne-ruled  uioeh'l  descriptions  can  be 
useel  as  elires  t  eibjert  mode  ls  <rf  the  ACRONYM  system. 

2.  Fitting  Generalised  Cylinelcr3  in  3D  Space 

3D  meielels  are  usually  built  by  using  standard  components  like 
eylinelcrs,  parallelcpipe*,  sphe  res,  ctr..  3D  models  can  he  constructed 
te>  any  degre*e  erf  accuracy  if  the  muiiber  erf  the  f.vce*  is  increnseel.’  How- 
ever,  tract  ability  anel  simplicity  rather  than  accuracy  of  the  model  b 
important  in  image  iuiderstan«ling  for  the  processing  stage*  of  predic¬ 
tion  and  matching.  (Ie*iie*ralir.«*el  Cylinders3  have*  been  wide  ly  used  as 
sue  b  powerful  cemi  pern  cuts  for  the  31)  mexleb  of  image  understanding 
syste  ms,  for  instance  the  ACRONYM.  The*  Generalised  Cylinder  Is  ;dso 
used  a  a  primitive*  for  huihling  models  in  this  syste*m. 

2.1.  Generalised  Cylinder  Fitting  Using  Knots 

The  most  intuitive  way  to  Imilel  3D  inoel  *b  is  to  build  models  di¬ 
rectly  in  3D  space*  viewing  actual  3D  examples.  3D  input  methods  have 
alsei  be*en  stuelieel  in  (.AD  auel  Compute*r  Craphics.'*  However,  in  the 
case*  of  3D  models  of  image  luiderstanding,  artiud  eihjccts  usually  exist. 
In  this  system,  3D  moele*ls  are  built  e*llicicnt!y  in  3D  space  forme*el  by 
stere*e»  images,  base-el  on  actual  examples.  Name  ly,  a  ste*re*ei  display  erf  an 
example  is  view  re  l  through  a  stereoscope  and  the*  Generalised  Cylinders 
me*  lift e*«  1  te»  e-acli  part  erf  the*  object  by  specifying  knots  in  3D  space. 

3  In*  e»e  e*urr«*ne*r  of  e*rre»rs  in  the*  iitoele*l  ele*e*riptiem  e*nu  he  pre*ve*nteel  by 
the  continuous  comparisem  erf  elisplaye-d  Generalized  Cylinder  with  the 
actual  object. 

As  the  geometric  mode-ling  of  synthetic  objectives  require*  a  high 
degree  of  accuracy,  many  knots  must  he*  spe*rifird  in  eireler  te>  fit  modeb 
te»  curves  e>r  surfaces.  Althemgh  the  surface*  can  be*  fitted  with  any  accu¬ 
racy  by  polyhedral  approximation,  input  and  modification  of  the*  tnoelcl 
is  not  easy.  The*  fitting  erf  curves  and  surface*  by  llc/.irr  e>r  D -spline 
formiihitiems  also  require*  many  control  points.  However,  ele*crihing 
ohje*cts  in  terms  of  Gt‘iie*rali/.e’f|  Cylinele*rs  for  the*  purpeise*  of  image  un- 
eleTstaiieling  purpose*  ran  he*  eleme*  simply  by  the*  feillowing  method. 

I)  From  knots  to  a  cross  sectiem 

A  (•eucrali/e*tl  (<ylinele*r  b  ele*lim*el  by  n  e*re»ss  se*e*tion  which  swe*e*|»s 

alemg  a  spine*,  changing  shape*  ace*e>reliug  to  a  sweeping  rule*  ;is  it 

continue*  te»  sweep.  The*  type*  of  Generalis'd  Cyliuelers  use  el  in  the 
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ACRONYM  h  shown  in  Table  l 


Cross-section 

Sweeping  rule 

Spine 

circle 

constant 

straight 

square 

linear 

circular 

rectangle 

bilinear 

non- perpendicular 

regular- polygon 

Table  1.  Types  of  Generalised  Cylinder 


A  cross  section  is  a  plane  and  a  plane  ran  be  determined  by  three 
knots  as  shown  by  an  cxmnplc  in  Figure  1  All  the  types  of  cross 
sit t ion  in  the  ACRONYM  can  lie  fitted  by  three  knots  as  illus¬ 
trated  in  Figure  2. 

2)  From  cross  sections  to  a  Generalised  Cylinder 

As  described,  a  Generalised  Cylinder  is  formed  by  a  cross  section’s 
swivping  along  a  spine.  All  the  types  of  Genendixeil  Cylinder  in 
the  ACRONYM  can  be  ftttinl  by  three  cross  sections  at  most.  For 
cxmnple,  three  cross  sections  are  required  for  the  types  of  Gener¬ 
alised  Cylinder  whirh  have  a  circular  spine  as  shown  in  Figure  3. 
Two  cross  sections  arc  required  for  the  types  of  Generalised  Cylin¬ 
der  which  have  a  straight  or  non-perpendicular  spine  and  have  a 
sweeping  rule  other  limn  the  constant  type.  Consequently,  nine 
knots  are  required  to  fit  three  cross  sections.  However,  the  most 
common  type  of  Generalised  Cylinder,  which  has  a  straight  spine 
and  a  constant  sweeping  rule,  can  he  specified  efficiently  using  only 
four  knots  as  illustrated  in  Figure  4.  Due  to  the  smaller  number 
of  required  knots  compared  to  the  usual  surface  lilting  mid  the  lo- 
cjU  controllability  in  tin*  stages  of  cri»ss  section  fitting,  Gem  faliscd 
Cylinder  filling  ran  he  done  simply  and  efficiently. 


Figure  1:  Au  example  of  fitting  a  cross  section  by  three  knots. 

2.2.  Using  Pre-dcftncd  Parts  Efficiently  in  Mood  Building 

Some  3D  objects  have  a  hierarchical  structure  mid  have  the  smne 
subparts.  Onrc  a  part  is  di*siTibcd  in  terms  of  a  Ceucrali/.eil  Cylinder, 
the  description  can  be  used  to  define  other  jmrts  efficiently.  Besides, 
this  has  the  advantage  that  3D  models  can  he  described  in  compact 
form  mid  modifications  can  be  done  to  all  the  common  parts  .siuiulta- 
ncously. 

1)  Identical  part 

The  same  definition  of  the  Generalised  Cylinder  ran  be  eliminated 
by  the  selection  «»f  mi  already  defined  part  name.  Also,  if  the 
relative  position  and  the  orientation  of  the  part  is  identical  to  al¬ 
ready  defined  parts,  spir  ideations  of  the  relation  described  later 
are  not  necessary  either.  Otherwise,  three  knots  are  nccinsary 


Figure  2:  Cross  section  type* 


Figure  3:  Fitting  a  Generalised  Cylinder  from  knots. 

for  the  specifications  of  the  position  mid  the  rotation  of  its  local 
coordinates  system.  Iterative  definition  of  the  identical  parts  en¬ 
ables  the  simultaneous  definition  of  sevend  identic  id  |Nirts.  After 
the  Generalised  Cylinder  definition  of  tin*  first  identical  part  and 
specification  of  the  number  of  iterative  ports,  two  or  three  knots 
are  sptTified  to  determine  the  positions  of  the  rest  of  the  identical 
1>arts.  Namely,  if  the  positions  of  the  identical  parts  me  circular, 
three  knots  are  required.  If  the  position  is  Linear,  two  knots  are 
enough  to  determine  the  positions. 

2)  Symmetrical  part 

Some  objects  have  symmetrical  subparts.  The  symmetriral  part 
can  also  be  described  in  terms  of  the  model  di'scriptions  already 
defined.  For  example,  the  port  wing  of  the  object  jet  aircraft  is 
symmetriral  to  starboard  wing.  In  this  ease*,  the  specifications 
of  port  wing  can  be  eliminated  by  utilir.iug  the  definition  of  the 
st aboard  wing  wliicli  is  symmetrical  against  the  y-r.  plane  of  the 
fuselage.  The  model  description  of  the  port  wing  is  generated  from 
the  descriptions  the  stahoard  wing. 


Figure  4:  Fitting  a  Generalised  Cylinder  by  four  knots. 


3.3.  Spaclal  Relations  between  Parts 

The  final  g«?oiiietric  relation  between  each  part  of  an  object  is  de¬ 
scribed  in  terms  of  the  relative  position  and  the  relative  rotation  of 
the  local  coordinates  system  of  the  part  to  another  part.  The  relations 
between  parts  described  below  caw  l>e  spccifieil  through  the  menu  selec¬ 
tion.  At  least  the  affix  relation  must  lx*  specified  to  miothcr  part.  The 
system  automat icidly  calculates  nectamiry  iwu-aineters  for  the  tranafor- 
ination  of  coordinates  system  bas<*d  on  the  relative  location  of  the  part 
in  3D  apace  and  generates  relation  descriptions.  Also,  several  relations 
can  be  specified  for  the  accurate  definition  of  the  geometric  relation. 
For  instance,  relations  like  Aligh,  Cnphuiar,  and  Flush  can  he  specified 
to  other  parts  which  are  already  defined.  If  these  kinds  of  relations  are 
specified,  the  local  coordinate  of  the  Generalise)  Cylinder  is  slightly 
modified  and  the  accurate  model  descriptions  are  generated. 

Also  the  subpart  relation  must  be  specified,  i.e.  the  name  of  the 
subpart  must  1m*  defined  when  the  subpart  is  specified.  The  system 
generate  a  subpart  tree  using  the  name.  If  the  object  example  is  not 
the  first  example  of  the  class,  the  name  can  be  selected  from  the  menu. 

3.  Forming  Object  Class  from  Examples 

After  the  model  descriptions  of  the  object  instance  arc  generated 
as  described  in  Section  2,  the  model  of  the  object  class  is  learued  in¬ 
ductively,  InumhI  mi  the  examples  or  models  of  the  object  instances. 

3.1.  Representation  for  Learning 

In  the  ACRONYM  system,  data  structures  are  represented  by  the 
frami'-like  structure,  i.e.,  each  data  object  is  an  instance  of  a  unit.  Units 
have  n  set  of  associated  slots  whose  fillers  define  their  values.1  Objects 
are  reprt'sented  liy  object  graph  whose  arcs  are  suhpart  mul  affixment. 
The  suhpart  arc  describe  a  coase  to  fine  structural  hierarchy  repre¬ 
sented  by  a  suhpart  tree.  Affixment  arcs  relate  coordinate  systems  of 
objects.  Cbiss  is  represented  through  a  mechanism  of  constraints  and  a 
restriction  graph.  The  constraint  is  inequalities  on  algehaic  expressions 
which  defines  a  set  of  values  which  can  be  taken  l»y  algebraic  expres¬ 
sion  in  the  slot,  i.e.,  5.0  <  width  <  6.0.  The  restriction  graph  is  used 
to  organise  the  constraints  into  class,  subclass  and  instance.  Namely, 
a  set  of  constraints  can  define  object  class,  subclass  or  instance  and 
the  liicrarchy  of  each  set  of  constraints  can  be  defined  by  restriction 
graph.  Figure  5  shows  an  example  of  the  class  hierarchy  of  the  jet  air¬ 
craft.  The  nodes  in  each  level  have  corresponding  constraints,  though 
one  set  of  subpart  tree,  affixment  tree,  cylinder  descriptions  exits  for 
the  object  class.  The  constraints  of  the  predecessor  are  applied  to  its 
successor  nodes,  for  instance,  the  constraints  of  B-747  arc  applied  to 
both  B-747B  and  B-747SP. 

3.2.  Rules  of  Generalisation 

Generalisation  to  form  a  class  model  can  be  done  simply  owing 
to  the  representaion  described  above.  The  model  descriptions  of  the 
obj<et  instance  built  by  the  system  as  described  in  Section  2  arc  written 
in  terms  of  constraints!  “="  is  one  of  the  forms  of  the  constraints). 
The  constraints  can  be  used  for  variations  in  sixe,  in  structure,  and 
in  spatial  relations!  ii|>».  For  example,  a  structure  can  be  expressed  like 
LEG-QUANT1TY=  3.  Generalisation  of  the  model  descriptions  can  be 
done  by  generalising  these  constraint.  For  example,  suppose  B-747B 
has  the  constraint, 

FUSELAGE- LENGTH =  67.3 

and  B-747SP  has  the  constraint. 

FUSELAGE- LENGTH=  52.0 

These  constraints  are  generalised  and  the  following  constraint  of  the 
FUSELAGE-LENGTH  of  the  class  B-747  is  obtained. 

52.0  -  i  < FUSELAGE- LENGTH <  67.3  +  c 


Class:  Generic  Jet  Aircraft 


•  Subpart  tree  *  Constraints 

Affixment  tree 
Cylinder  descriptions 

Figure  S:  An  example  of  the  class  hierarchy. 

where  t  is  a  tolerance.  Constraints  for  structure  can  be  generalised 
in  the  same  manner.  If  the  corresponding  suhpart  cannot  be  found 
in  a  certain  instance,  the  quantity  of  the  subpart  in  that  instance  is 
regarded  as  0. 

In  summary,  the  teacher  teaches  the  object  instance  using  an  ex¬ 
ample  and  specifies  the  name  of  the  class /subclass  to  which  it  belongs. 
The  system  generalise*  the  model  descriptions  of  the  object  class  as 
described.  Namely,  all  the  constraints  in  the  predecccssor  nodes  in  the 
restriction  graph  are  generalised.  For  example,  if  wc  add  a  new  in¬ 
stance  to  the  c  lass  B-747,  say  B-747X,  the  constraints  in  node  ‘B-747* 
and  ‘generic  aircraft*  are  generalised. 

The  inductive  learning  described  above  uses  only  positive  exam¬ 
ples  and  the  model  descriptions  sometimes  became  overgcncralised. 
The  specialisation  of  the  class  mode)  through  the  negative  examples 
or  counterexamples  should  |h*  done.  In  this  system,  descriptions  of  the 
class  mode)  ran  be  specialised  with  the  adtliiion  of  new  constraints. 
When  the  ACRONYM  system  misrecognixes  an  object  which  is  not 
included  in  the  class,  the  example  can  be  used  as  a  negative  or  coun¬ 
terexample.  Using  this  luisrccngnixcd  example,  the  teacher  can  add  a 
new  type  <»f  constraint  in  the  model  description. 

Learning  from  examples  can  be  categorised  into  two  types,  the  one- 
trial,  and  the  other,  incremental.  “Although  this  incremental  method 
parallels  human  learning,  it  is  apt  to  Irad  one  down  garden  paths  by 
an  injudicious  choice  of  initial  examples  in  formulating  the  kernel  of 
the  new  concept".7  Tliercfore,  in  this  system,  the  lonelier  must  select 
good  examples  for  forming  good  class  descritions.  Finally,  the  system  is 
currently  being  planned  to  include  some  ty|x*  of  constructive  induction 
in  order  to  increase  its  power. 

4.  User  Interface  and  Examples 

The  user-friendly  interface  of  building  models  are  described  using 
actual  example*  in  this  section.  Namely,  the  method  to  iuput  knots  in 
3D  space  mid  to  fit  the  Generalised  Cylinder  is  describe^!,  Imsrd  on  the 
Generalised  Cylinder  fitting  method  described  in  Section  2. 

4.1.  User-friendly  Interface 

Figure  G  shows  the  following  hardware  devices  which  are  used  m 
user  interface.  (1)  Scanning  Stcmmcopc  ODSS  III  (2)  display  device 
for  stereo  images  (3)  Voice  Rcroguiscr  SYS300  (4)  trackball  (5)  TSS 
terminal.  The  system  generates  3D  models  through  the  following  user- 
friendly  interface  using  tlu*e  hardware  devices. 

1)  Convenient  biput 

The  system  is  designed  to  use  only  voice  and  the  pointing  device. 

Tin*  keyboards  are  not  usually  usiil  except  in  the  case  where  names 
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Figure  6:  Hardware  of  Stereo  Modeling  System 

are  specified.  The  voice  input  is  especially  useftd  in  Generalised 
Cylinder  fitting  operations  in  which  both  eyes  are  fixed  on  an  ob¬ 
ject  and  a  hand  is  placed  on  the  trackball  as  described  later. 

2)  Menu  Selection 

Commands  to  the  system  are  done  in  terms  of  menu  selection 
using  voice.  With  this  ineim  method,  users  need  not  memorise 
commands  and  the  operations  can  be  learned  easily,  guided  by  the 
system. 

Also,  short  help  messages  explaining  what  to  do  next  arc  displayed 
in  the  lower  juvrt  of  the  screen  whenever  necessary.  If  the  user  is  com¬ 
pletely  lost  as  to  what  to  do  next,  “help”  in  the  menu  can  be  selected. 
Tilt'  system  always  prepares  help  commands  in  menus  and  the  user  can 
read  displayed  help  messages. 

4.2.  3D  Input 

The  knot  input  is  done  directly  in  3D  space  formed  by  stereo  im¬ 
ages.  The  jnisitinn  of  the  knot  in  3D  space  ran  be  calculated  by  the 
corresponding  points  in  two  different  images  based  on  the  theory  of 
Pliotograniniet ry . 5  From  the  viewpoint  of  speed  and  accuracy,  3D  in¬ 
put  is  done  through  the  following  two  stages. 

1)  Coarse  initial  positioning  with  two  cursors 

The  coarse  initial  position  of  the  knot  is  specified  by  pointing  to 
corresponding  points  in  the  stereo  images,  using  two  cursors  dis¬ 
played  on  the  same  epipolar  line*  in  the  stereo  pair.  Although  the 
3D  positioning  using  the  cursol  moved  by  the  trarkliall  is  fast,  it 
is  impossible  to  specify  the  position  lietwecn  one  pixel.  The  one 
pixel  dis|>arity  between  the  corresponding  points  in  stereo  images 
sometimes  affects  the  accuracy  of  the  depth  measurement  signifi¬ 
cantly.  Therefore,  in  order  to  position  the  knot  in  3D  space  with 
a  high  degree  of  accuracy,  the  following  3D  pointer,  which  can  be 
moved  in  3D  space  with  any  degree  of  accuracy,  is  used  for  fine 
positioning  of  the  knot. 

2)  Fine  |msitioniug  with  3D  pointer 

The  accurate  3D  position  of  the  knot  can  be  specified  through  the 
3D  juunter  displayed  in  3D  space.  The  3D  pointer  can  be  moved 
in  six  directions  in  3D  space  with  voice  commands,  as  illustrated 
in  Figure  7.  Also,  the  speed  of  the  3D  pointer  can  be  changed, 
i.c.,  the  speed  can  be  change'll  to  x2  and  1/2  by  “fast"  and  “slow" 
commands  respectively.  For  instance,  if  you  say  “slow"  three  times, 
the  speed  is  reduced  to  (1/2)5  =  1/8.  The  3D  pointer  can  be 
moved  in  3D  spare  after  all  the  initial  input  of  necessary  knots  for 
generating  a  cross  section  or  a  Generalised  Cylinder  is  finished. 
Although  it  is  difficult  to  ilisplay  the  3D  pointer  !>ctwccn  pixels 
ill  the  image,  the  3D  position  of  the  knot  can  bo  determined  with 


great  accuracy  by  observing  a  displayed  cross  section  reflecting 
the  move  of  the  knot  For  example,  a  radius  of  a  circular  cross 
section  is  sometimes  highly  sensitive  to  the  slightest  movement  of 
one  knot. 

4.3.  Stereo  Images 

Steri'o  images  of  objects  are  obtained  by  placing  objects  in  front 
of  a  stereo  camera.  Currently,  the  online  stereo  camera  is  not  used  and 
stereo  images  which  are  stored  in  a  disk  file  are  used  by  the  system.  As 
desribed  before,  the  our  pixel  differenre  is  sometimes  very  important  for 
accurate  Generalis'd  Cylinder  fitting  in  3D  spare,  so  a  high- resolution 
display  device  is  needed.  To  solve  this  problem,  we  built  zoom  function, 
i.e.,  a  part  to  be  modeled  can  be  enlargiHl  and  displayed  whenever 
accurate  modeling  is  required.  This  mniiii  function  enables  accurate 
modeling,  as  if  a  high- resolution  display  device  were  avalable.  Also  it 
is  In'tter  that  the  stereo  images  themselves  should  l»e  high-resolution. 
For  this  purpose,  in  addition  to  the  normal  sine  picture,  high-resolution 
stereo  imagt'S  are  also  used  by  the  system.  Any  portion  of  the  stereo 
images  with  any  resolution  are  display <><1  using  the  resolution  pyramid 
of  the  stem)  images  with  this  sooin  command. 

4.4.  Examples 

The  user-interface  described  above  will  be  shown  with  actual  ex¬ 
amples  of  Generalised  Cylinder  Fitting  operations.  Figure  8  shows  a 
displayed  original  stem)  pair  of  industrial  parts  with  a  command  menu. 
Suppose  we  built  a  model  of  one  of  the  parts.  After  the  room  com¬ 
mand  is  selected  and  the  location  of  the  room  window  is  specified  with 
the  tracklwiU.  the  enlarged  stem)  jwir  of  the  selected  part  is  displayed 
as  shown  in  Figure  9.  With  two  cross-haired  cursors,  the  initial  3D 
position  of  a  knot  is  specified  as  shown  in  Figure  10  after  the  type  of 
Geueralisi'd  Cylinder  is  selected  from  the  menu.  In  this  figure,  **v* 
shows  the  position  of  already  specified  knots.  After  the  spiv  ific  at  ions 
of  three  knots,  the  knots  can  be  movi'd  by  selecting  one  knot  with  the 
cursor  pointing  to  the  vicinity  of  the  knot.  The  3D  pointer  “!"  appears 
as  shown  in  Figure  1 1  .uni  tin*  3D  pointer  nui  be  movi'd  in  3D  space 
with  voice  commands.  The  displayed  cross-section  shown  in  Figure  11 
is  also  movi'd /transformed,  reflecting  the  movement  of  the  3D  pointer. 
After  the  fitting  operation  of  the  cross  section  is  finished,  the  last  knot, 
in  this  example,  is  specifit'd.  When  all  the  knot  input  is  done,  the  Gen¬ 
eralised  Cylinder  is  displayed  mid  the  knots  can  also  be  movi'd  at  this 
stage.  Figure  12  shows  the  ilisplayiHl  Generalised  Cylinder  in  terms  of 
two  cross  si vt ions  As  this  system  dor*  not  use  ;uiy  spivial  graphics 
hardware  and  sinre  it  is  written  ill  Lisp  called  Surci'ssor  Lisp  (also  some 
graphs-  (Nikages  arc  written  in  C),  it  take*  tin)  long  for  real-time  inter¬ 
action  if  moving/transforiiiiiig  Gem  rulisnl  Cylinders  nre  well  displayed 
with  liidili'ii-lini'  removal.  Therefore  only  rross  si'ctions  are  displayed 


Figure  7:  3D  pointer  controled  by  voice  commands 
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instead  of  a  Generalized  Cylinder  while  the  Generalized  Cylinder  is 
being  fitted.  Besides,  hidden- line  removal  is  not  necessary  as  long  as 
the  stereo  line  drawing  is  used  as  there  is  no  ambiguity  of  3D  structure 
in  the  stereo  line  drawing.  Figure  13  shows  Generalised  Cylinder  de¬ 
scriptions  generated  by  the  system  after  all  the  fitting  operations  we 
finished. 

Although  it  might  be  accurate  enough  for  most  purp‘**es  (The  ac¬ 
tual  size  of  the  jwirt  is  31minx33mmx  lGmni),  the  Generalised  Cylinder 
descriptions  would  have  been  more  accurate  if  we  had  measured  camera 
parameters  more  accurately. 

It  takes  about  three  or  four  minutes  to  fit  one  Generalised  Cylin¬ 
der  with  four  knots,  though  the  problems  of  display  time  anil  3D  input 
accuracy  had  to  be  solved.  The  total  time  required  to  fit  General¬ 
ised  Cylinders  to  all  the  subparts  of  an  object  depends  on  how  many 
subparts  it  has.  For  example,  if  the  object  is  formed  by  two  or  three 
subparts.  ten  minutes  would  be  enough  for  building  the  model  of  the 
object.  With  this  user-friendly  interface,  even  a  person  who  is  not  a 
computer  expert  can  leant  how  to  build  models  easily  and  can  built 
them  simply  and  quickly  without  errors. 

5.  Conclusions 

The  Stereo  Modeling  System  which  enables  simple,  fast  model 
building  was  described.  The  key  ideas  here  are  the  following:  (1)  build 
a  model  of  an  object  instance  from  a  3D  example  of  the  object.  (2) 
learn  the  class  model  of  the  object  from  examples.  This  system  is  easy 
to  learn,  i.e.,  even  a  person  who  is  not  a  computer  expert  can  master 
it  in  an  hour  or  so  and  models  of  3D  objects  can  be  built  easily  and 
quickly,  e  g.,  a  model  of  a  simple  object  with  two  or  three  parts  can  be 
built  in  10  minutes  without  errors. 

Although  the  problem  of  the  accuracy  of  the  modeling  was  solved 
by  the  zoom  function  and  the  3D  pointer  which  cun  be  moved  in  3D 
space  within  one  pixel  of  the  stereo  imagi*s,  this  modeling  system  might 
not  be  suited  for  the  highly  accurate  modeling  of  complicated  objects. 
However,  the  simplicity  rather  th;ui  the  high  accuracy  of  3D  model 
descriptions  is  important  in  Image  Understanding.  Even  we  don't  have 
accurate  3D  modcLs  of  objects  in  the  world  in  our  brains.  Although  the 
3D  models  which  this  system  produces  might  not  be  suited  for  CAD  or 
Computer  Graphics  in  which  a  high  degree  of  accuracy  is  important, 
they  are  believed  to  be  accurate  enough  for  most  applications  in  image 
understanding. 

It  would  be  ideal  if  3D  models  were  built  automatically  without 
any  aid.  However,  it  Is  still  difficult  to  build  useful  3D  models  for  Image 
Understanding  fully  automatically.  In  this  system,  the  example  of  the 
objis  t  is  divided  into  parts  according  to  the  required  accuracy  liascd 
on  human  judgement  and  geometric  feature  of  the  object  arc  taught 
and  described  efficiently  using  the  knowledge  of  the  object  we  have. 
Models  ran  be  built  from  one  stereo  pair  for  we  know  that  there  is  no 
tail  in  the  hidden  part  of  the  object  .  However,  a  fully  automatic  system 
without  such  knowledge  requires  pictures  of  the  object  from  all  different 
angl<*s.  N  evert  he  less,  it  is  true  that  in  our  system,  a  stereo  pair  from 
a  good  viewpoint  is  needed  and  several  stem>  pairs  might  be  required 
for  a  complicated  object.  It  is  possible  to  continue  building  the  model 
from  different  angles  if  camera  parameters  are  known  mid  this  may 
la*  necessary  in  future.  However,  the  key  idea  here  is  to  build  models 
directly  in  3D  spare  using  actual  examples,  i.e.,  3D  space  need  not  be 
stereo  images  and  stereoscope.  It  is  also  possible  to  fit  Generalised 
Cylinders  using  knots  in  actual  3D  space  via  a  3D  pointer  like  a  robot 
arm.  This  type  of  alternative  method  might  be  useful  if  a  good  3D 
|K>iiitcr  Is  available. 

Some  classes  of  objects  have  various  geometric  instances  and  they 
are  best  d<*scrib<*d  by  functionalfeaturi'S  and  higher  geometric  features.* 
On  the  other  hand,  there  Is  another  class  of  objects  which  can  best  be 
described  by  geometric  features  and  Is  difficult  to  describe  liy  functional 
features,  e.g..  there  is  no  man  with  W  legs  ;uid  what  me  the  functional 
features  of  a  man  ?  Also  there  is  a  class  of  ohj<*cts  which  can  be 
described  Initli  functionally  and  gemiict rically,  e.g.,  tennis  lialls  are 
always  round,  notebooks  me  ri'ctmigular,  and  they  have  functions.  As 


the  first  step  towards  the  powerful  modeling  system  which  can  handle 
both  functional  and  geometric  feature,  the  geometric  modeling  system 
described  here  is  designed  to  build  models  of  objects  *vliich  can  be 
described  by  geometric  features.  This  system  and  the  ACRONYM 
system  do  not  yet  handle  functional  and  higher  geometric  features. 
They  would  become  more  powerful  if,  in  future,  they  could  handle 
these  features. 
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Figure  11:  Fitting  the  cross  section  with  the  3D  pointer. 


Figure  8:  Stereo  pair  of  industrial  parts(origiual  siae] 


Figure  9:  Enlarged  stereo  pair. 


Figure  12:  Fitting  the  Generalized  Cylinder 


Figure  10:  Initial  knot  input  with  two  cursors. 


Figure  13:  Generated  part  Descriptions. 
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Abstract 

The  Intelligent  Task  Automation  Project  involves  the 
automatic  location  and  assembly  of  a  tray  of  parts  in  an 
uncontrolled  environment  (i.e..  no  control  of  lighting).  This 
paper  provides  a  briel  description  of  the  ACRONYM  vision 
system  and  concentrates  on  extensions  for  location  of  the 
assembly  components.  The  characterises  of  the  components 
are  significantly  different  than  the  donums  in  «/hich  ACRONYM 
has  been  demonstrated  in  particular  It  •  •  parts  include  holes  and 
springs.  Problems  in  representation.  im..jO  feature  prediction 
and  image  interpretation  are  discussed  along  with  significance 
for  future  work  in  model  based  vision. 


Introduction 

The  IT  A  Project 

The  Intelligent  Task  Automation  (ITA)  Project1  includes  the 
automatic  assembly  of  a  push  button  micro  switch  in  an 
unstructured  environment.  That  is,  no  special  lighting  or  part 
feeders  are  used.  The  ITA  project  is  nearing  the  completion  of 
the  first  ol  two  phases:  phase  one  involves  technology 
development  and  feasibility  analysis;  phase  two  is  an 
implementation  which  achieves  goals  of  speed  and  performance. 
The  system  is  planned  to  operate  as  follows: 

•  Off  line  strategy  selection 

o  Analyze  a  three  dimensional  model  of  each 
component  of  the  micro  switch  and  predict  the 
shapes  and  relations  between  shapes  which 
will  be  seen  in  a  grey  scale  image. 

o  From  the  predicted  features  and  relationships 
for  each  part,  select  a  set  of  particularly 
important  characteristics. 

o  Select  algorithms  to  locate  the  expected 
leatures.  and  to  match  the  relations. 

•  On  line  execution 

o  Receive  a  tray  of  parts  (assume  the  tray 
contains  all  components  for  one  micro¬ 
switch). 

o  Shake  the  tray  to  eliminate  unstable  part 
poses. 

o  Locate  the  first  (next)  part  for  the  assembly  in 
order.  Send  the  part  position  and  orientation 
to  a  robot  arm  controller  for  part  acquisition 
and  assembly. 


Grey  -Scale  Camera  and  Tray  of  Parts 


The  Switch  Assembly  Parts 
Figu  re  1 :  Camera  and  Tray 


Two  types  of  sensors  are  available  for  part  location  -  a  grey¬ 
scale  CCD  camera  is  mounted  above  the  tray  and  there  are  two 
range  sensors.  One  is  a  sparse  sensor  mounted  on  an  arm,  and 
the  other  is  an  dense  range  sensor  mounted  overhead  -  both 
using  a  plane  of  laser  light  in  a  triangulation  process. 

The  ACRONYM  vision  system 

The  ACRONYM  vision  system2,3  is  a  feasibility  demonstration 
of  automatic  feature  prediction  and  image  interpretation. 
ACRONYM  was  chosen  for  its  ability  to  "reason"  about  three 
dimensional  geometric  models,  and  from  this  analysis,  to 
automatically  predict  image  features  and  relations.  Thus,  no 
training  is  required  only  the  geometric  models  are  given  to 
ACRONYM.  In  this  paper,  we  will  describe  ACRONYM'S 
capability  to  reason  about  a  new  class  of  obiects  (the  micro¬ 
switch  parts)  and  to  recognize  these  parts  from  grey  scale  image 
data. 
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ACRONYM  was  initially  used  for  location  of  aircraft  in  aerial 
photographs  It  was  also  used  to  model  short  range  views  of 
some  simple  electric  motors.  Hughes  Research  Laboratory  then 
used  ACRONYM  to  locate  ships  from  various  camera  angles. 
The  ITA  switch  parts  represent  a  significantly  different  set  of 
geometric  features  than  these  initial  domains. 


The  base  ACRONYM  vision  system  reasons  about  models  of 
the  objects  which  may  appear  in  an  image  and  is  domain- 
independent  (i.e.,  the  reasoning  is  based  on  geometry  rather 
than  on  domain  specific  knowledge).  The  system  can  be  divided 
into  four  modules  -  modeling,  prediction,  image  description  and 
image  interpretation.  The  user  models  the  objects  in  the  domain 
and  specifies  the  relationships  between  the  objects  and  in  the 
world.  Base  ACRONYM  analyzes  the  object  and  world  models  to 
predict  image  features  and  relationships.  When  given  an  image, 
the  interpretation  module  uses  these  predictions  to  identify 
objects  in  the  world  image,  inferring  three  dimensional 
information  concerning  the  shape,  structure,  location  and 
orientation  of  the  objects. 

Modeling  System 

Objects  An  object  model  describes  the  three  dimensional 
shape  and  structure  of  an  object.  The  object  is  divided  into 
simple  subparts,  each  described  by  a  generalized  cone  volume 
primitive.  (A  generalized  cone  describes  a  volume  by  sweeping  a 
planar  cross  section  for  example,  a  circle,  along  a  spine4.)  Base 
ACRONYM  implements  a  subset  of  the  possible  volumes 
represented  by  generalized  cones.  The  spine  must  be  straight 
and  cross-sections  may  be  one  of  a  circle,  square,  or  rectangle. 
The  cross  section  may  be  deformed  linearly  in  one  or  both 
dimensions  (i.e.,  the  dimensions  of  the  cross-section)  and  may 
be  held  at  an  arbitrary  fixed  angle  to  the  spine.  For  example,  a 
cube  is  a  square  swept  along  a  straight  spine.  This 
representation  is  not  unique;  each  of  the  three  axes  of  symmetry 
may  be  used  for  the  spine  in  this  example. 

In  addition  to  individual  objects,  base  ACRONYM  can 
represent  generic  classes  of  obiects.  This  is  accomplished  by 
using  symbolic,  rather  than  numeric,  parameters  for  the  volume 
descriptions.  In  the  initial  ACRONYM  demonstrations,  747  and 
L1011  aircraft  were  described  by  the  same  generic  model 
containing  symbolic  parameters.  Each  aircraft  type  had  a  set  of 
numeric  constraints  on  these  symbolic  variables. 

The  structure  of  an  object  is  described  by  the  relative 
positions  and  orientations  of  its  subparts.  The  volume  of  each 
subpart  is  described  by  a  generalized  cone  and  each  generalized 
cone  has  a  local  coordinate  system.  The  object  also  has  a 
coordinate  system.  The  position  and  orientation  of  a  subpart 
(generalized  cone)  relative  to  the  object  is  specified  by  the 
transformation  between  their  coordinate  systems. 

Scene  An  object  which  is  expected  to  be  seen  in  an  image  is 
placed  into  the  world  model  by  specifying  the  transformation 
between  the  world  and  object  coordinate  systems.  Usually  some 
or  all  of  the  transformation  parameters  are  symbolic  -  if  the 
position  and  orientation  of  an  object  are  known  in  advance,  then 
a  vision  system  is  not  needed  to  locate  the  object. 

Camera  In  addition  to  knowing  which  objects  may  be  in  the 
world  image,  base  ACRONYM  uses  knowledge  about  the  camera 
and  the  camera  location  relative  to  the  world  coordinate  system 
to  guide  its  prediction  and  interpretation.  In  the  ITA  task  the 
camera  location  is  precisely  known  but  ACRONYM  can  reason 
about  the  appearance  of  obiects  without  precise  information. 
For  tire  aerial  photo  demonstrations,  the  focal  ratio  of  the  camera 


was  known  exactly  but  the  altitude  and  camera  orientation  were 
underspecified  The  altitude  was  constrained  to  be  between 
1,000  and  12,000  meters,  and  the  camera  was  known  to  be 
pointing  generally  at  the  ground. 

Predictions 

Given  a  set  of  object  models  and  a  scene  and  camera  model, 
geometric  reasoning  rules  are  used  to  predict  shapes  and 
relationships  that  will  be  observable  in  an  image.  Features  which 
are  always  observable  (observable  within  the  entire  modeled 
range  of  camera  viewpoints)  are  called  invariant.  Unfortunately, 
few  features,  in  general,  will  be  visible  over  the  entire  range  of 
camera  and  object  position/orientation.  Usually  there  are 
accidental  viewpoints  in  the  range  of  camera  and  object 
locations  For  example,  the  sides  of  a  cylinder  will  produce 
parallel  lines  in  an  image  for  all  viewpoints  except  when  the 
viewpoint  is  collinear  with  the  axis  of  the  cylinder  (not  invariant). 
However,  if  the  cylinder  is  known  to  lie  on  its  side,  and  the 
camera  position  is  known  to  be  above  the  plane  of  the  cylinder, 
then  the  parallel  lines  are  invariant. 

ACRONYM  also  analyzes  the  range  of  camera/object 
geometry  to  predict  features  and  relations  that  change  slowly 
with  changes  in  viewpoint,  called  quasi  invariants,  but  which  are 
not  invariantly  observable.  If  no  knowledge  were  available  in  the 
previous  example,  the  length  of  tfie  cylinder  would  still  be  quasi- 
invariant  because  it  changes  slowly  with  change  in  viewpoint. 

P  'dictions  are  described  by  the  prediction  graph.  The 
nodes  of  the  graph  represent  the  predictions  of  image  shape 
features,  and  the  arcs  of  the  graph  specify  relations  between  the 
features.  As  will  be  seen,  relationships  between  shapes  are 
critical  to  ACRONYM. 

Shape  Prediction  ACRONYM  predicts  image  shapes  for  the 
"swept"  sides  and  for  the  end  faces  of  a  generalized  cone.  A 
trapezoid  is  a  two  dimensional  projection  of  the  "swept”  sides  of 
a  generalized  cone  with  a  straight  spine.  The  projection  of  an 
end  face  can  also  be  described  by  a  trapezoid  for  square  and 
rectangular  cross-sections.  Circular  cross  sections  produce 
ellipses  in  an  image  The  size  of  the  shapes  is  constrained  by 
upper  and  lower  bounds  based  on  the  modeled  range  of  camera- 
to  object  geometry. 

Relationship  Prediction  In  addition  to  shapes,  relations  (arcs 
Of  the  prediction  graph)  between  shapes  are  also  predicted. 
During  interpretation,  pairs  of  hypothesized  matches  of  image 
features  to  prediction  nodes  (shapes)  are  checked  for 
consistency  by  attempting  to  find  the  predicted  relationship 
between  them  (which  is  represented  by  a  prediction  graph  arc). 
Prediction  arcs  are  generated  to  relate  multiple  shapes  predicted 
for  a  single  cone  as  well  as  between  different  generalized  cones 
(subparts)  for  an  object.  The  latter  are  actually  more  important  in 
arriving  at  a  consistent  global  interpretation  of  collections  of 
image  features  as  complex  objects. 

Exclusive  arcs  relate  image  features  which  are  mutually 
exclusive.  For  example,  the  two  end  faces  of  a  solid  cylinder 
cannot  be  visible  at  the  same  time.  Collinear  and  connected  arcs 
represent  invariant  relationships.  If  two  straight  lines  are 
collinear  in  three  space,  then  their  images  must  be  collinear, 
except  in  the  rare  case  of  a  single  degenerate  point.  If  two  cones 
are  physically  connected  in  three  space,  they  are  connected  in 
any  image.  Angle  arcs  are  predicted  between  the  spines  of  two 
generalized  cones.  For  those  image  features  which  are  not 
connected,  distance  arcs  are  generated  to  provide  additional 
constraints.  Finally,  contained  arcs  relate  two  predicted  shapes, 
one  of  which  contains  the  other  in  the  image. 


Modeling 


Bark  Constraints  The  shape  and  size  of  image  features  are 
clearly  viewpoint  dependent  The  predictions  are  constrained 
with  upper  and  lower  bounds  based  on  the  range  of  camera  to 
object  geometry.  In  addition  to  predicting  the  shapes  and 
relations,  the  prediction  module  also  generates  equations  to 
constrain  the  relative  camera  object  geometry  from  image 
measurements.  For  example,  from  most  viewpoints  the  circular 
end  of  a  cylinder  will  appear  as  an  ellipse.  Furthermore,  the 
eccentricity  of  the  image  ellipse  constrains  the  orientation  of  the 
cylinder  relative  to  the  camera. 

Image  Description 

ACRONYM  describes  an  image  as  a  graph  (the  picture 
graph)  of  trapezoids  and  ellipses.  That  is,  edges  must  be  linked 
into  trapezoids  and  ellipses.  The  prediction  graph  provides 
general  guidelines  for  the  shapes  and  sizes  that  can  be  expected 
from  the  image.  The  description  module  then  looks  for  those 
shapes,  ignoring  the  predicted  relationships  between  them.  The 
result  is  a  picture  graph,  consisting  of  shape  descriptions  and 
their  locations  and  orientations  in  the  image. 

Interpretation  (Matching) 

ACRONYM  interprets  images  by  subgraph  matching 
(subgraph  isomorphism)  between  the  picture-graph  description 
of  the  image  and  the  prediction  graph  expectation.  It  proceeds 
by  combining  local  matches  of  shapes  to  individual  generalized 
cones  (from  the  ribbon  finder),  into  global  matches  for  complete 
objects  The  global  interpretation  must  satisfy  the  requirements 
specified  by  the  arcs  of  the  prediction  graph  (i.e.,  the 
relationships  between  subparts).  The  constraints  that  each  local 
match  implies  on  the  three  dimensional  model  must  be  globally 
consistent.  As  hypothesized  local  matches  are  combined  into 
more  complete  obiects,  the  size  of  image  shapes  and 
relationships  provide  constraints  on  the  camera  position  and 
orientation  relative  to  the  object.  Inconsistent  constraints 
indicate  an  incorrect  hypothesis. 

Initial  Domain  (i.e..  Aerial  View  of  Aircraft) 

The  success  of  the  original  implementation  is  largely  due  to 
the  fact  that  most  obiects  exhibit  structure  which  is  well 
aoproximated  by  skeletal  description.  Consider  the  human 
ability  to  recognize  stick  figure  drawings.  Stick  figure  drawings 
resemble  the  obiects  they  represent  only  in  skeletal  structure 
-  the  relative  lengths  of  each  segment  and  the  relationships 
between  segments5  Aircraft  in  particular  are  skeletal  and  can  be 
disambiguated  by  their  skeletal  structure. 

Extensions  to  Acronym 

The  application  of  ACRONYM  to  industrial  parts  represents  a 
significant  domain  shift.  The  base  ACRONYM  system  included 
sullicient  geometric  reasoning  to  demonstrate  the  power  of 
feature  prediction  and  image  interpretation  but  the 
implementation  was  not  completely  general.  In  particular,  the 
features  of  aircraft  from  aerial  viewpoints  produce  image 
trapezoids  Not  surprisingly,  a  large  proportion  of  its  geometric 
reasoning  rules  are  devoted  to  prediction  and  interpretation  of 
trapezoids. 

In  order  to  recognize  the  parts  of  the  switch  assembly, 
modifications  to  base  ACRONYM  were  required.  These 
modifications  included  changes  to  and  extension  of  the 
modeling,  prediction  and  image  interpretation  (object  matching) 
modules  In  this  section  we  discuss  the  modifications  which 
wore  necessary  and  the  performance  of  the  IT  A  version  of 
ACRONYM  on  the  micro  Switch  components. 


Holes  Base  ACRONYM  was  capable  of  reasoning  about  solid 
volumes  only  Every  item  in  the  switch  assembly  contains  a  hole 
and  for  many  objects  the  hole  is  an  essential  characteristic  (e  g., 
a  washer)  Our  description  of  holes  follows  the  approach  of  base 
ACRONYM  for  solid  objects  We  describe  a  hole  by  a 
generalized  cylinder  with  an  additional  tag  indicating  that  it  is 
void  (negative  volume).  We  restrict  the  representation  to  holes 
which  are  surrounded  by  some  solid  This  restriction  allows  us  to 
model  all  the  parts  in  the  ITA  task  but  avoid  arbitrary  intersection 
of  volumes. 

Springs  While  many  man  made  objects  are  easily  described 
by  a  generalized  cylinder  with  straight  spine,  the  micro  switch 
assembly  includes  three  springs  (two  cylindrical  and  one  conical 
spring).  We  have  implemented  a  new  generalized  cylinder  spine 
type:  the  helix.  A  spring  is  a  circular  cross  section  swept  along  a 
helical  spine.  A  helical  spine  can  also  be  used  to  describe  the 
threads  of  a  bolt  by  using  a  triangular  cross  section.  Special 
primitives  describe  the  ends  of  a  spring  A  squared  end  indicates 
that  the  helix  is  deformed  (squared)  on  the  last  revolution  of  each 
end  A  plain  end  is  simply  the  undeformed  helical  spine.  The 
model  of  the  conical  (tapered)  spring  uses  Ihe  taper  primitive  to 
describe  the  ratio  of  the  small  end  of  the  spring  to  the  large  end. 

Stable  States  A  further  extension  to  the  modeling  system  is  a 
representation  lor  the  stable  states  ol  an  object.  A  stable  state  is 
an  orientation  in  which  the  obiect  will  remain  stable  on  a  level 
surface6  For  example  a  convex  regular  polyhedron  is  stable  on 
any  of  its  faces.  A  right  circular  cylinder  is  stable  on  its  ends  and 
on  its  side  (its  stability  depends  on  the  ratio  of  radius  to  height). 
The  representation  consists  ol  constraints  on  the  orientation  and 
position  of  the  object  in  the  tray.  Some  information  may  be 
unspecified,  such  as  the  orientation  about  the  right  circular 
cylinder's  axis  (i.e  ,  it  can  roll  when  lying  on  its  side).  By  using 
Symbolic  constraints  we  can  not  only  model  the  nominal  stable 
state  position  (on  a  plane),  but  can  also  represent  perturbations 
of  the  stable  state  explicitly.  For  example,  in  its  nominal  stable 
state,  the  large  washer  lies  flat  on  a  plane.  Our  representation 
lor  it,  however,  can  also  describe  its  orientation  when  lying  at  an 
angle  (on  top  of  another  object). 


Nominal  Leaning 

Figu  re  2 :  Stable  States  of  a  Washer 

While  ACRONYM  is  capable  of  reasoning  about  positions  and 
orientations  which  are  completely  unconstrained,  the  stable 
states  allow  ACRONYM  to  produce  tight  bounds  on  the  predicted 
shapes  and  relationships.  Stable  state  constraints  reduce  the 
complexity  of  prediction  and  allow  image  interpretation  to 
proceed  more  quickly  and  to  achieve  higher  confidence  obiect 
hypotheses. 

The  position  of  the  camera  relative  to  the  world  coordinate 
system  is  well  known.  When  combined  with  the  stable  state 
description  for  an  object,  this  means  that  the  predicted  shape 
and  relationship  parameters  are  constrained  to  exact  values  (i.e., 
lower  and  upper  bounds  are  equal)  and  no  back  constraint 
equations  are  generated.  However,  if  the  object  orientation  is 
described  with  some  error  tolerance  (in  case  the  object  may  lean 
on  another  object),  back  constraints  will  be  useful  in  determining 
the  actual  object  orientation. 


Prediction 


The  prediction  module  received  the  most  extensive  changes. 
We  improved  ellipse  prediction,  and  added  a  variety  of  new 
relational  predictions  between  shapes.  ACRONYM  finds  image 
shapes  (which  match  shape  predictions)  and  uses  predicted 
relationships  to  prune  the  search  for  a  set  of  shapes  which  is 
consistent  with  the  hypothesized  object  Increasing  the  number 
and  strength  of  relationship  predictions  improves  the 
performance  and  accuracy  of  image  interpretation. 

Relationships  between  the  subparts  of  an  object  are  very 
important  to  ACRONYM'S  performance.  For  the  ITA  parts  we 
found  that  the  relationships  which  base  ACRONYM  could  predict 
were  not  sufficient.  Stronger  relations  were  needed  to  constrain 
possible  interpretations  and  to  link  the  different  shapes 
predicted  Because  base  ACRONYM  had  no  prediction  for 
holes,  no  relationship  existed  between  the  inner  void  and  outer 
solid  of  a  hollow  cylinder.  (Base  ACRONYM  assumed  everything 
was  solid  so  it  never  reasoned  about  the  insides  of  anything.) 
Washers  would  have  been  predicted  as  a  single  ellipse.  Clearly 
these  weak  predictions,  such  as  a  single  ellipse  for  a  washer, 
would  not  lead  to  high  confidence  in  object  interpretation 
hypotheses  Thus  we  needed  new  relational  invariants,  or  at 
least  semi  invariants.  The  extensions  that  are  most  important  for 
micro  switch  parts  follow: 

•  reasoning  about  stable  states 

•  reasoning  about  holes 

•  concentric  relation  between  concentric  cylinders 
(e  g.,  solid  and  void  cylinders  describing  a  washer) 

•  connected  relation  between  ellipses  and  trapezoids 
and  between  ellipses  and  ellipses 

•  parallel  relation  between  coils  of  a  spring 

•  enclosed  relation  between  coils  of  a  spring  and  the 
imaginary  bounding  of  the  spring 

Figure  3  demonstrates  prediction  of  an  object  with  holes. 
This  prediction  example  is  for  stable  state  three  of  the  bushing 
(standing  on  its  smaller  end)  The  bushing  is  composed  of  five 
cylinders  •  two  outer  (solid)  volumes  and  three  inner  (hole) 
volumes  (see  figure  3a). 


Concentric 


Stable  state  reasoning  Reasoning  was  added  to  achieve 
better  prediction  about  the  range  of  shapes  that  will  be  visible  in 
an  image.  The  constraints  on  object  location  form  a  context  for 
the  object  model.  ACRONYM  was  modified  to  reason  about  the 
appearance  of  an  object  model  within  multiple  contexts  (i.e., 
multiple  stable  states).  The  improvements  allow  greater 
performance  and  produce  more  accurate  image  shape 
descriptions. 

Holes  We  have  implemented  some  useful,  although  certainly 
not  complete,  reasoning  about  holes.  We  assume  that  a  hole  is 
fully  contained  in  a  solid  the  shape  of  the  hole  and  the  shape  of 
the  solid  are  the  only  shapes  to  consider.  We  need  not  consider 
arbitrary  intersections.  (Note  that  generalized  cylinders  are  not 
closed  under  arbitrary  intersection).  While  this  may  seem 
restrictive,  the  solid  hole  relationships  in  most  industrial  parts  are 
quite  simple  ■  collinear,  parallel  or  perpendicular  axes. 

Much  of  the  reasoning  about  which  surfaces  of  a  solid  may 
be  visible  does  not  apply  to  holes.  Surfaces  of  a  solid 
generalized  cylinder  may  occlude  surfaces  which  are  further 
from  the  viewpoint.  The  touching  surfaces  of  connected  solids 
are  also  occluded.  Holes,  however,  are  invisible.  A  hole  cannot 
occlude  anything  neither  its  own  surfaces  which  are  further 
from  the  camera  nor  any  object  connected  to  the  hole.  The 
swept  surfaces  of  a  solid  object  are  often  quite  important 
indeed,  for  aerial  views  of  aircraft,  only  swept  surfaces  are 
visible.  The  swept  surface  is  almost  never  visible  for  a  straight 
hole.  While  it  is  possible  to  imagine  an  object  where  the  swept 
side  of  a  hole  is  visible  (e  g.,  half  of  a  hole),  we  do  not  consider 
such  rare  objects. 

Concentric  Shapes  The  concentric  relationship  is  important 
for  many  of  the  objects  which  contain  holes,  especially  for 
washer-like  and  ring-like  objects.  If  two  generalized  cylinders  are 
concentric  in  three  dimensions,  they  are  conueii  u»v*  111  any 
projection.  That  is,  concentricity  is  invariant  and  therefore  a  very 
strong  pruner  of  hypothesized  interpretations.  For  the  ITA 
project,  the  concentric  relationship  is  predicted  for  10  of  15 
objects  (17  of  the  32  stable  states). 

A  weaker  form  of  concentricity  was  also  implemented.  If  two 
cylinders  (of  unequal  length)  share  the  same  axis  and  the  shorter 
one  is  inside  the  other,  then  whenever  you  see  both  holes,  they 
are  concentric,  or  nearly  concentric.  This  applies,  in  particular, 
to  the  case  of  structures  inside  a  cylinder  ■■  such  as  the  bushing 
viewed  from  either  end. 

Connected  Shapes  Reasoning  about  the  appearance  of 
holes  is  often  different  than  for  solids.  Both  ends  of  a  solid  are 
never  simultaneously  visible  (the  swept  side  occludes  one  end), 
For  a  hole,  however,  both  ends  are  visible  whenever  the 
viewpoint  is  near  the  line  of  the  hole  axis.  Furthermore,  if  both 
ends  are  visible  then  they  must  be  touching  in  the  image 
(remember  the  hole  is  surrounded  by  some  solid).  The  ends  of  a 
hole  are  connected. 

While  some  code  to  predict  ellipses  existed  in  the  original 
ACRONYM  system,  ellipse  prediction  and  matching  were  not 
fully  implemented  few  relationships  involving  ellipses  were 
implemented.  We  added  a  stronger  form  of  connectedness  to 
ACRONYM.  Instead  of  connectedness  between  only  trapezoid 
and  trapezoid,  we  now  predict  connectedness  between  ellipses 
and  trapezoids,  and  between  ellipses  and  ellipses.  This 
relationship  is  important,  for  example,  for  viewpoints  from  which 
one  sees  the  side  and  one  end  of  a  cylinder. 

Parallel  Trapezoids  Relation  between  parallel  trapezoids  is 
important  for  the  prediction  of  springs.  If  two  generalised 
cylinders  or  two  parallel  coils  of  a  spring  are  parallel  in  three 


dimensions,  their  images  are  also  parallel  under  any  projection. 
This  relation  is  another  invariant  we  have  made  use  of.  When  the 
spring  is  lying  flat  on  its  side,  in  one  of  its  stable  stale,  the  parallel 
coils  of  the  spring  appear  as  a  set  of  parallel  trapezoids  in  the 
image  (figure  4).  These  trapezoids  are  not  connected  in  the 
image.  Since  the  spring  is  modelled  as  a  single  part  and  not  as  a 
collection  of  subparts,  the  distance,  angle,  and  contained 
relations  are  no  longer  useful.  A  new  parallel  relation  is  thus 
predicted  between  each  pair  of  the  trapezoids.  This  new  relation 
also  helps  to  prune  off  other  undesirable  random  trapezoids  in 
the  image. 

Enclosed  Trapezoids  To  increase  our  confidence  in 
prediction  of  springs,  we  also  predict  an  imaginary  bounding 
hollow  cylinder  enclosing  a  spring(figure  4).  The  coils  of  the 
spring  are  enclosed  in  this  imaginary  cylinder.  Thus,  when  the 
spring  lies  on  its  side,  the  projection  of  this  imaginary  cylinder 
will  appear  as  a  trapezoid  enclosing  all  the  parallel  trapezoids 
projected  from  the  parallel  coils  of  the  spring  When  the  spring  is 
standing  on  its  ends,  the  prediction  is  exactly  the  same  as  that 
for  a  hollow  cylinder  of  the  same  size. 


Bounding  Parallel 

Cylinder  Coils 


a)  Spring  in  Space 


Bounding  Parallel 

T  rapezoid  trapezoids 


b)  Spring  in  Image 
Figure  4:  Spring  Features 

Performance  on  the  ITA  parts 

Our  changes  have  been  very  successful  Of  the  fifteen  parts, 
ten  can  be  recognized  in  any  stable  state  We  can  automatically 
generate  shape  predictions  for  all  of  the  remaining  five  parts  but 
the  currently  implemented  relationships  are  not  sufficient  to  link 
the  shapes  (i.e. .  we  cannot  yet  locate  these  two  parts  in  an 
image). 

The  ITA  version  on  ACRONYM  has  been  tested  on  data  from 
real  pictures.  We  have  fully  automated  the  process  from  picture 
to  interpretation;  the  linking  of  edges  into  image  shapes  is 
accomplished  automatically  (the  original  image  description 
module  did  not  find  ellipses). 

Although  analyzing  an  object  model  and  predicting  the  image 
shapes  which  it  will  produce  is  expensive,  it  need  only  be  done 
once  That  is.  predictions  are  done  "offline"  and  stored  for  later 
use  by  the  image  description  and  image  interpretation  modules. 
The  complexity  of  the  obiect  increases  the  cost  of  prediction. 
The  large  washer,  a  very  simple  object,  requires  14  CPU  seconds 
on  a  Vax  1 1/780  while  prediction  lor  the  switch  element  requires 
about  6  CPU  minutes  for  both  ol  its  stable  states,  A  large 
performance  improvement  can  be  realized  by  recoding  the 


algorithms;  the  current  implementation  uses  a  MACLISP 
compatability  package  which  runs  on  top  of  FRANZ  LISP. 

Figure  5  illustrates  ACRONYM'S  performance  at  image 
interpretation.  Although  the  parts  will  be  located  individually  for 
the  actual  assembly,  for  this  example  we  instructed  ACRONYM 
to  search  lor  the  bushing  and  lor  the  two  switch  elements 
simultaneously.  Only  the  strongest  hypothesis  for  each  object  is 
shown  in  the  interpretation.  Notice  also  that  the  image  data  is 
imperfect  and  that  one  terminal  pin  is  missing.  In  general, 
deformed  or  missing  subparts  of  an  object  affect  only  the 
strength  of  the  interpretation  hypothesis  and  will  not  cause  a 
hypothesis  to  be  discarded. 

The  asymmetry  of  the  switch  element  is  an  important 
consideration  lor  the  assembly  operation  ITA  ACRONYM  has 
determined  that  both  switch  elements  lie  on  their  "left''  side. 
Two  incorrect  hypotheses  were  also  generated,  each  labeling  a 
switch  element  with  the  wrong  stable  state  (i.e.,  lying  on  its 
"right"  side)  The  incorrect  hypotheses  contain  fewer  predicted 
shapes  (they  are  weaker)  because  each  hypothesis  omits  the 
middle  pin  (the  middle  pin  position  is  inconsistent  with  the  right- 
side  hypothesis)  Generation  ol  the  switch  element  hypotheses 
required  85  CPU  seconds  in  this  example  (again,  improvements 
Can  be  achieved  by  re  coding). 

When  searching  lor  the  bushing  in  any  of  its  three  stable 
states,  many  of  the  concentric  shapes  in  figure  5c  caused 
ACRONYM  to  generate  bushing  hypotheses.  The  correct 
interpretation  is  the  strongest  (figure  5d)  all  four  image  ellipses 
are  consistent  with  the  stable  state  3  hypothesis.  The  same 
ellipses  also  form  ti  e  second  strongest  interpretation;  three  of 
the  ellipses  are  consistent  with  the  bushing  standing  on  the  other 
end.  That  is.  tire  image  shapes  from  the  bushing  generated  the 
two  strongest  interpretations. 

Several  weak  hypotheses  were  generated  from  other  sets  of 
image  ellipses  For  example.  ACRONYM  generated  a  weak 
hypothesis  that  the  large  washer  (the  concentric  circles  just 
below  the  bushing)  is  a  bushing  standing  on  end  The  outer 
diameter  of  the  washer  is  identical  to  the  bushing  and  the  inner 
diameter  is  within  error  tolerance.  The  correct  hypothesis  for  the 
bushing  is  stronger  because  more  of  the  predicted  features  were 
located  in  the  image  (i.e  .  the  internal  structure  of  the  bushing 
disambiguates  the  possible  interpretations). 

Sometimes  ACRONYM  is  unable  to  generate  a  single  strong 
hypothesis  for  an  object  For  example,  if  the  viewpoint  in  the 
preceeding  example  had  been  oblique  to  the  point  that  no 
internal  structure  was  visible,  the  incorrect  hypothesis  (the 
washer)  would  have  been  equally  strong.  Better  object 
identification  can  sometimes  be  achieved  by  tightening  the  error 
bounds  on  finding  an  ellipse  (e  g  .  to  eliminate  the  inner  diameter 
of  the  washer).  However,  a  few  obiects  produce  ellipses  of 
identical  dimensions  and  cannot  be  uniquely  located  by  finding 
image  shapes  When  no  one  object  hypothesis  is  sufficiently 
strong,  the  sparse  range  sensor  (light  stripe)  will  be  used  to 
obtain  a  depth  profile  to  distinguish  between  the  hypotheses. 

Figure  6  illustrates  the  situation  where  ITA  ACRONYM  fails  to 
detect  the  desired  part.  The  bottom  switch  element  shapes 
conform  to  the  poor  edge  data  for  this  image.  The  shapes  for  the 
top  switch  were  artificially  degraded  lor  illustration.  Because 
image  data  is  inexact,  image  interpretation  allows  for  errors  (in 
both  the  base  and  the  ITA  versions  of  ACRONYM).  In  this 
example,  the  image  trapezoid  for  the  body  of  the  lower  switch 
element  is  within  error  bounds,  but  the  body  ol  the  upper  switch 
is  not  (it  does  not  match  the  prediction).  Only  three  of  the  six 
terminal  pins  match  the  shape  predictions  (the  upper  two  pins  of 
tire  lower  switch  element  and  the  leftmost  pin  of  the  upper  switch 
element).  Image  errors  also  affect  the  relationships  between 
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d)  Interpretation 

Figure  5:  Search  tor  Bushing  and  for  Switch  Element 


b)  Edges 


d)  Interpretation 


Figure  6:  Search  for  Switch  Element  with  Bad  Data 
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shapes.  Even  if  the  upper  switch  body  had  matched  the  shape 
prediction,  the  angle  between  the  body  and  the  leftmost  terminal 
pm  is  out  of  the  allowed  error  tolerance  (the  hypothesis  would  be 
discarded)  For  any  composite  object  (multiple  subparts),  image 
data  must  match  at  least  two  subparts.  Three  shapes  for  the 
lower  switch  element  are  within  the  error  tolerance  and  their 
relationships  are  mutually  consistent,  that  is,  the  image  data  is 
sufficient  to  form  and  retain  the  displayed  hypothesis. 

Future 


Planar  Reasoning 

ACRONYM  currently  reasons  about  3  D  shapes  and  3  0 
relationships  (such  as  collinearity)  which  are  invariant  when 
projected  into  two  dimensions.  This  type  of  reasoning  provides 
useful  constraints  on  viewpoint  but  is  inadequate  for  detailed 
object  recognition  or  inspection  The  housing  (figure  7)  contains 
a  plane  with  6  holes  in  a  pattern.  While  it  may  be  possible  to 
identify  3  D  relationships  for  this  pattern  of  holes,  this  is 
essentially  a  collodion  of  2  D  relationships  and  reasoning  in  two 
dimensions  is  more  appropriate. 

Notice  that  the  hole  pattern  is  not  symmetric  about  the 
vertical  axis:  the  orientation  of  the  object  is  determined  by 
recognizing  the  orientation  of  the  hole  pattern.  We  will  describe 
the  planar  cluster  ol  holes  by  specifying  a  coordinate  frame  for 
the  plane  and  describing  the  location  of  each  hole  in  that 
coordinate  system  The  predictions  will  relate  size  of  the  circular 
plane  (containing  the  6  holes)  to  the  size  of  each  hole  and  will 
specify  that  the  major  axes  of  tlie  image  ellipses  must  be  parallel. 

With  the  addition  of  this  2  D  reasoning,  ACRONYM  will 
predict  a  hierarchy  of  features  ••  "coarse"  3-D  features  for 
generating  object  hypotheses,  and  "fine"  2-D  features  for 
detailed  reasoning.  When  interpreting  an  image,  the  hypothesize 


Perpendicular  View  Oblique  View 

Figure  7:  Housing 

and  test  strategy  will  first  match  coarse  image  shapes  to  3  D 
predictions,  then  fine  .mage  features  to  the  2  D  predictions.  Re¬ 
processing  of  the  raw  image  may  be  required  to  obtain  the 
desired  line  image  data  (over  small  portions  of  the  image). 

Reflectivity 

Specular  reflections  are  particularly  important  for  the  ITA 
parts  (as  for  any  shiny  object)  ACRONYM  currently  contains  a 
very  simple  light  source  model  (a  single  point)  which  was  used 
for  crude  shadow  prediction  in  a  previous  protect.  Ideally,  we 
want  to  reason  about  multiple  and  mote  complex  light  sources  to 
predict  characteristics  of  specular  reflections. 


For  example,  a  shiny  cylinder  usually  exhibits  bars  or  stripes. 
The  specular  surface  approximates  a  mirror.  The  cylindrical 
shape  of  the  mirror  causes  light  from  a  large  area  of  space  to  be 
reflected  onto  a  small  area  ol  the  viewing  plane  Under  general 
lighting,  many  large  light  sources  (while  walls,  fluorescent  lights, 
etc  )  produce  obvious  light  stripes.  Each  stripe  is  widened 
(diffused)  by  the  imperfections  of  the  mirror,  i.e.,  the  shiny 


surface  is  not  a  perfect  mirror. 

Large  Light-Source 


V 


Viewpoint 

Figure  8:  Light  Stripes  on  a  Metallic  Cylinder 

Reasoning  about  light  sources  can  also  predict  shadows.  In 
a  number  of  images,  a  cylindrical  object  reflects  its  own  shadow. 
This  shadow  is  important  as  it  often  masks  the  actual  boundary 
of  the  cylinder  We  wish  to  account  for  this  effect  when 
calculating  the  upper  and  lower  bounds  on  the  apparent 
diameter  of  the  cylinder  (when  lying  on  its  side). 

Useful  predictions  can  be  achieved  without  complex  light- 
source  models  Under  the  general  lighting  assumption,  a  shiny 
cylinder  will  always  exhibit  stripes.  Stripes  will  be  visible  anytime 
the  portion  of  the  environment  which  is  reflected  by  the  cylinder 
is  not  uniform  Predicting  the  number,  size,  or  location  of  the 
stripes  requires  sophisticated  light  models. 

Successor 

Part  of  the  value  of  this  project  is  that  we  identify  important 
issues  lor  the  next  generation  model  based  vision  system.  Work 
is  underway  on  the  "successor"  to  ACRONYM. 

ACRONYM  employs  limited  image  shape  descriptions  all 
shapes  were  approximated  by  trapezoids  or  ellipses  These 
shapes  are  sufficient  for  the  coarse  reasoning  necessary  to 
constrain  the  viewpoint  but  obviously  cannot  describe  all  the 
possible  image  projections  in  our  set  ol  switch  parts.  Much 
current  research  in  computer  vision  is  devoted  to  the  description 
of  shapes  in  an  image. 

Many  of  the  parts  for  the  switch  assembly  contain  fine  details 
such  as  rounded  corners  and  small  slots  which  are  important  for 
successful  assembly.  While  it  is  possible  to  model  these  fine 
details  in  ACRON7M  the  rules  which  embody  the  reasoning 
about  them  would  be  complex,  awkward  and  numerous.  In  the 
future  we  hope  to  use  a  uniform  representation  to  avoid  the  case 
analysis  that  becomes  unmanageable  in  ACRONYM  Case 
analysis  is  acceptable  lor  unproved  performance  in  common 
cases  but  must  be  supported  by  general  reasoning. 

ACRONYM  uses  the  structure  of  obiects  (the  relative  length 
of  each  subpart,  the  angle  between  two  subparts,  etc  )  for 
feature  prediction  and  image  interpretation.  It  is  quite  competent 
with  obiects  having  skeletal  structure  such  as  objects 
recognizable  from  stick  figure  drawings.  However,  several  of  the 
switch  parts  have  similar  skeletal  structure  (e  g.,  all  the  washers 
have  the  same  structure).  Additional  relationships  will  be 
required  lor  classes  of  objects  which  lack  definitive  skeletal 
structure. 

ACRONYM  uses  graph  matching  (subgraph  isomorphism)  to 
find  image  features  (the  interpretation  graph)  represented  in  the 
prediction  graph.  Relationships  connect  the  shapes  for  both 
graphs.  In  real  images,  some  weak  relationships  may  be  lost  due 
to  noise  or  artilacts  Irom  edge  detection.  Subgraph  isomorphism 


314 


makes  no  allowance  for  partial  evidence  •  a  node  is  either 
completely  consistent  with  all  the  predicted  relationships  or  it  is 
inconsistent.  Intelligent  interpretation  of  image  data  requires 
matching  groups  of  features  and  relationships  on  the  basis  of 
partial  evidence. 


Conclusion 

The  original  ACRONYM  vision  system  demonstrated  the 
power  of  geometric  reasoning  for  image  feature  prediction  and 
image  interpretation.  It  was  very  successful  in  recognizing 
objects  from  skeletal  structure.  The  modifications  for  the 
Intelligent  Task  Automation  protect  have  added  a  new  capability 
to  reason  about  and  recognize  objects  with  holes.  A 
representation  for  springs  was  developed  and  new  relations  for 
spring  feature  prediction  were  implemented.  Results  on 
successful  and  unsuccessful  image  interpretations  were 
presented. 
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ABSTRACT 

We  discuss  the  design  of  a  large  scale  Content 
Addressable  Array  Parallel  Processor  (CAAPP)  for 
low,  medium  and  high  level  vision  processing.  This 
new  architecture  combines  associative  processing 
with  global  broadcast  and  response  to  and  from  an 
array  of  cells,  and  array  processing  via  local 
cellular  square  neighborhood  computation.  The 
capabilities  of  the  CAAPP  allow  us  to  close  the 
feedback  loop  between  high  level  processing  and  low 
level  processing  by  supporting  communication 
between  different  representations  of  an  image.  The 
CAAPP  would  provide  a  means  of  mapping  the  signal 
level  (iconic)  pixel-based  representation  of  an 
image  into  a  symbolic  intermediate  level 
representation  suitable  for  high  level  vision 
processing. 

1 .0  INTRODUCTION 

The  Content  Addressable  Array  Parallel 
Processor  (CAAPP)  is  a  new  architecture  specially 
designed  for  machine  vision  processing  [WEE82,  84a , 
84b].  The  CAAPP  is  a  "processor  per  pixel" 
parallel  image  architecture  which  represents  the 
synthesis  of  both  content  addressable  processors 
(such  as  STARAN  or  ASPRO  [ BAT82 ] )  and  mesh 
connected  parallel  array  processors  (such  as  ILLIAC 
IV  [BAR68J  or  CLIP-4  [DUF78 J ) .  The  full  system 
will  augment  the  CAAPP  array  via  its  controller 
with  a  host  processor  such  as  a  VAX/11/780  or  a 
LISP  machine.  The  resulting  architecture  can  be 
used  for  both  associative  and  array  type  operations 
encountered  in  image  processing  and  computer  vision 
tasks  to  produce  simple  solutions  that  are 
difficult  for  parallel  machines  which  provide  only 
one  of  these  capabilities  (Figure  1). 

The  CAAPP  will  be  capable  of  performing  all  low 
level  image  processing  tasks,  but  more  importantly 
it  will  provide  a  mechanism  for  transforming  low 
level  image  data  into  higher  level  symbolic  data 

directly  -  without  mediation  (or  serial 

processing)  by  symbolic  "host"  processors.  Thus, 
it  allows  for  a  new  style  of  high  level  algorithms 
where  processing  decisions  can  be  based  on  direct 
global  feedback  information  from  the  processing 
elements.  We  have  closed  the  feedback  loop  between 
low  level  and  high  level  processing  by  providing  a 
control  interface.  The  key  to  this  capability  is 
the  provision  of  fast  global  feedback  operations 
from  the  array  to  the  controller  (Figure  2). 


In  the  next  sections  we  discuss  our  view  of  the 
goals  of  general  machine  vision  in  terms  of  image 
interpretation.  We  continue  with  a  discussion  of 
the  characteristics  of  our  design  of  the  CAAPP  as  a 
machine  for  vision  processing.  We  give  a 
description  of  the  CAAPP  and  show  the  utility  of 
such  a  design  by  discussing  some  of  the  algorithms 
we  have  examined  during  our  design  process.  Then 
we  illustrate  the  process  of  mapping  a  low  level 
iconic  representation  of  an  image  into  a  symbolic 
representation  suitable  for  high  level  processing. 

2.0  THE  MACHINE  VISION  PROBLEM 

The  processing  requirements  needed  to  solve  the 
machine  vision  problem  are  not  well  understood. 
The  difficulty  is  that  general  machine  vision  is 
far  from  being  solved,  and  is  currently  a  rapidly 
evolving  area  of  research.  At  this  point  in  time, 
no  one  can  give  a  detailed  algorithmic 
specification  for  a  general  vision  interpretation 
system.  However,  it  is  possible  to  give  a  list  of 
features  that  must  be  present  in  any  machine  that 
is  to  be  used  to  significantly  advance  the  vision 
problem.  We  believe  that  if  such  machines  are 
built,  they  will  greatly  facilitate  research  and 
clarify  many  issues  in  machine  vision  development. 


By  machine  vision,  or  image  understanding,  we 
mean  much  more  than  image  processing  which, 
usually,  refers  to  the  enhancement  and 
classification  of  images.  The  goal  of  machine 
vision  is  the  automatic  transformation  of  an  image 
to  a  symbolic  form  that  represents  a  description 
and  an  understanding  of  the  content  of  the  image. 
In  general,  the  machine  vision  problem  subsumes  the 
tasks  performed  in  normal  image  processing. 

The  image  understanding  process  can  be  thought 
of  as  an  iconic  to  symbolic  (or  signal  to  symbol) 
transformation.  The  input  image  data  is 
essentially  an  array  of  signal  data  and  forms  an 
iconic  representation  of  the  real  world.  To 
perform  image  interpretation  the  machine  must 
transform  this  data  into  a  symbolic  form.  The 
transformation  is  from  low  level  information  (e.g., 
the  pixel  at  coordinates  1112,47]  has  a  blue 
intensity  value  of  1/)  to  symbolic  representations 
of  objects  in  the  scene,  in  terms  of  predefined 
knowledge  about  objects  in  the  world  (e.g.,  region 
75  in  the  image  is  an  instance  of  the  object  class 
HOUSE — DOOR).  This  task  involves  monocular  static 
image  interpretation  as  well  as  integrating 
information  from  multiple  sensory  sources  including 
stereo  input,  motion  sequences,  and  laser  ranging 
information. 

From  our  perspective,  the  machine  vision 
problem  will  be  described  as  involving  three  levels 
of  processing.  These  are  referred  to  as  the  low, 
intermediate  and  high  levels  (Figure  j).  The  low 
level  consists  mainly  of  operations  on  pixels  and 
local  neighborhoods  of  pixels.  This  may  involve 
segmentation  algorithms  to  partition  pixels  into 
regions  of  similar  color  and  texture  properties. 


and  to  extract  lines  via  intensity  and  color 
discontinuities  at  local  edges.  The  result  of  what 
we  call  low  level  processing  is  a  transformed  image 
with  labeled  regions  and  line  segments.  However, 
we  are  assuming  that  no  operations  relating 
different  image  events  have  been  performed  nor  have 
there  been  any  inferences  on  the  object  identity  of 
these  events. 

The  intermediate  level  of  representation 
provides  an  interface  between  the  low  and  high 
levels  of  representation,  that  is,  between 
pixel-based  representation  and  symbolic  elements 
representing  visual  knowledge  stored  in  a  database. 
In  the  UMASS  VISIONS  system  [HAN7i)a,b,HANdj] ,  which 
is  the  environment  in  which  most  of  this  research 
was  conducted,  the  intermediate  level  consists  of  a 
symbolic  description  of  the  two  dimensional  image 
in  terms  of  regions  and  line  segments  (that  are 
still  in  registration  with  the  raw  image  data)  as 
well  as  their  associated  attributes  which  can  be 
used  in  the  interpretation  process.  In  some 
systems  this  level  would  consist  of  representations 
of  surfaces,  or  more  generally,  "intrinsic" 
features  of  the  physical  environment  [BAR78 ,MARd2 J . 

Intermediate  processing  includes  several  kinds 
of  activities.  First  is  the  set  of  bottom-up  tasks 
which  are  needed  to  complete  the  intermediate  level 
of  representation.  This  includes  the  extraction  of 
the  features  for  regions,  lines,  and  vertices  as 
well  as  the  relations  between  these  entities.  The 
results  of  this  processing  are  representations  of 
image  entities: 

*  Regions  have  information  about  intensity, 
color,  texture,  location,  size,  major  axis 
orientation,  compactness,  labels  of 
bordering  segments  and  adjacent  regions, 
etc. 
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Communications  and  Control  Acron  Multiple  Level*  of  Repreeentation 


High  Level  •  Schema  -  Symbolic  Deacriptiona  of  Object*  -  Control  Strategies 


Rule- Baaed  Object  Matching  and  Inference: 

Object  Hypotheai*  ||  |  Grouping,  Spliting  and  Adding 
Region*,  Line*  and  Surface* 

Intermediate  Level  -  Symbolic  Deacription  of  Region*,  Line*,  Surface* 

Segmentation  IT  Goal-Oriented  Resegmentation: 
Feature  Extraction  Additional  Features,  Finer  Resolution 


Low-Level  -  Pixel*  -  Arrays  of  Intensity,  RGB,  Depth 
(Static  monocular,  stereo,  motion) 

i 

Figure  3 


*  Line  segments  have  information  about 
location,  orientation,  length,  width, 
contrast,  labels  of  adjacent  regions,  etc. 

•  Vertices  have  information  about  location, 
wnat  line  segments  they  connect,  their 
curvature,  etc. 


These  representations  are  stored  along  with  the 
normal  "pixel"  information  In  the  processing 
elements  which  hold  the  respective  image  objects. 
Note  that  the  relationships  between  the  objects  in 
the  image  and  objects  in  the  world  are  not  yet 
elucidated. 

The  second  group  of  intermediate  processing 
activities  involve  grouping,  splitting,  and 

labeling  processes,  in  either  data-directed  or 
knowledge  directed  modes  (i.e.,  bottom-up  or 

top-down)  to  form  intermediate  events  which  more 
naturally  match  stored  object  descriptions.  Some 

operations  in  this  class  are: 

•  Labeling  points  of  high  curvature  on  the 
perimeter  of  a  region. 

•  Merging  co-linear  line  segments  based  on 
the  properties  of  their  adjacent  regions, 

•  Merging  adjacent  regions  based  cn  their 
relationship  to  shared  line  segments. 


features  that  are  expected,  spatial  relations  to 
other  objects  which  might  be  identified,  and  a  set 
of  control  strategies  Tor  partial  matches,  for 
merging  fragmented  regions  (possibly  due  to 
texture),  for  filling  in  missing  information 
(possibly  due  to  occlusion),  and  finally  for 
inferring  the  presence  of  related  objects.  The 
result  of  high  level  processing  is  a  symbolic 
representation  of  the  content  of  a  specific  image 
in  terms  of  the  general  stored  knowledge  of  the 
object  classes  and  the  physical  environment. 

Information  flow  between  representations  is  in 
both  directions.  In  the  upward  direction,  the 
communication  consists  of  segmentation  results  from 
multiple  algorithms,  and  possibly  from  multiple 
sensory  sources.  It  also  involves  the  computation 
of  a  set  of  attributes  of  each  extracted  image 
event  to  be  stored  in  a  symbolic  representation. 
The  summary  information  and  statistics  allow 
processes  at  the  higher  levels  to  evaluate  the 
success  of  lower  level  operations.  It  is  also  the 
mechanism  for  the  passing  of  actual  symbols.  In 
the  downward  direction  the  communication  consists 
of  commands  for  selecting  subsets  of  the  image,  for 
specifying  further  processing  in  particular 
portions  of  the  image,  and  requests  for  additional 
information  in  terms  of  the  intermediate 
representation. 


The  high  level  processing  controls  the 
intermediate  level  of  processing  where  the  symbolic 
two-dimensional  representations  of  the  intermediate 
level  must  be  related  to  object  descriptions  stored 
in  a  knowledge  base.  The  object  descriptions 
represent  Information  about  the  three-dimensional 
world  in  a  representation  that  might  be  used  to 
form  matches.  Thus,  objects  will  be  represented  in 
terms  of  significant  region,  line,  and  surface 


From  the  above  description  of  the  machine 
vision  problem,  a  set  of  three  general  requirements 
for  a  machine  vision  architecture  can  be  deduced. 
The  first  is  that  the  machine  must  be  able  to  load 
(and  possibly  dump)  a  complete  image  in  well  less 
than  a  33  millisecond  frame  time  (or  in  parallel 
with  the  actual  processing  of  a  previous  frame). 
Loading  a  512  x  512  RGB  color  image  in  under  one 
frame  time  represents  a  rather  high  data  transfer 
rate. 


Second,  since  a  great  number  of  low  level 
operations  will  be  needed  to  support  processing  at 
the  higher  levels,  the  speed  requirement  Indicates 
the  necessity  of  the  pixel  per  element  class  of 
mesh  connected  (local  neighborhood)  cellular  array 
processors.  It  is  generally  recognized  that  these 
provide  the  greatest  speed  in  performing  low  level 
image  operations. 

Host  important  of  the  architectural 
requirements,  however,  is  that  a  general  vision 
machine  shoull  provide  mechanisms  for  communicating 
information  and  control  both  up  and  down  through 
the  three  levels  of  representation.  The  control 
program  must  be  able  to  determine  the  necessary 
summary  information  quickly,  so  that  it  can  try  a 
variety  of  processing  approaches  to  produce  the 
best  results.  This  type  of  communication  is 
necessary  to  permit  the  autonomous  transformation 
of  an  image  to  a  set  of  meaningful  symbols.  For 
this  reason,  the  mechanisms  that  provide  the 
summary  information  must  be  applicable  to  both 
pixel  and  symbol  data. 

To  summarize  -  a  key  issue  in  achieving  an 

effective  architecture  is  the  ability  to  maintain 
the  low  and  intermediate  representations,  pixels 
and  symbolic  region,  line  and  surface 
representations  simultaneously  in  the  same  machine. 
The  necessity  of  dumping  an  image  out,  for 
evaluation  by  a  sequential  program,  must  be  avoided 
at  all  cost.  It  is  too  time  consuming  to  transfer 
the  volume  of  information  contained  in  an  image. 
Even  if  it  took  no  time  to  dump  the  information, 
the  time  required  for  serial  evaluation  would  still 
be  too  great.  Dumping  an  image  for  outside 
evaluation  defeats  the  entire  purpose  of  having  a 
special  parallel  processor  for  computer  vision. 
Instead,  the  vision  machine  must  be  able  to  provide 
enough  feedback  to  the  controlling  processor  to 
allow  all  of  the  operations  to  take  place  within 
the  vision  machine  itself. 

3.0  AN  ARCHITECTURE  FOR  HACHINE  VISION 

The  CAAPP  has  been  designed  to  support  vision 
processing  research.  It  is  also  sufficiently 
general  that  new  approaches,  to  the  various  aspects 
of  vision,  can  be  easily  implemented  on  it.  It  is 
quite  simple  to  build  special  purpose  machines  that 
Implement  particular  image  processing  algorithms 
with  great  speed.  However,  as  mentioned  above, 
machine  vision  research  is  a  dynamic,  rapidly 
changing  area.  New  algorithms  are  constantly  under 
development  and  experimentation.  A  vision  machine 
must  therefore  be  sufficiently  fast  and  general  to 
allow  complex  experimentation  up  to  the 
Interpretation  level. 

The  basic  architectural  issues  to  be  addressed 
for  vision  stem  from  the  requirements  of  the 
probl’m: 

*  The  ability  to  process  both  pixel  and 
symbol  data 

*  A  fast  processing  rate 

*  The  ability  to  select  particular  subsets 
of  the  pixels  for  special  processing 


*  Feedback  mechanisms  that  allow  focussing 
of  attention  and  data-directed  processing, 
without  having  to  dump  the  image  for 
external  evaluation. 

•  The  ability  to  transform  an  image  into  a 
set  of  meaningful  symbols  that  describe 
it. 

The  solution  that  we  have  developed  is  a 
machine  that  is  a  fusion  of  mesh  connected  cellular 
array  processors  with  associative  or  content 
addressable  parallel  processing  capability. 
Previous  research  has  shown  that  a  mesh  connected 
cellular  array  is  a  structure  that  is  extremely 
well  suited  to  performing  basic  image  processing 
tasks.  With  one  processing  element  per  pixel,  such 
a  machine  can  perform  many  of  the  basic  image 
processing  operations,  including  both  the  pixel  and 
local  neighborhood  classes  of  operations,  very 
quickly.  The  problem  with  the  cellular  arrays  that 
have  been  proposed  is  that  they  generally  do  not 
provide  for  selective  processing  of  pixel  subsets 
(such  as  collections  of  regions  or  line  segments), 
nor  do  they  supply  feedback  to  the  controller.  In 
other  words,  they  do  not  provide  the  necessary 
bidirectional  communication  between  symbolic 
processing  and  pixel  processing.  An  image  is 
simply  loaded,  some  operations  are  applied  to  it, 
and  then  the  image  is  returned  for  external 
sequential  processing  or  human  presentation. 

Research  on  content  addressable  parallel 
processors  (CAPPs)  has  always  emphasized  selecting 
and  processing  arbitrary  subsets  of  the  data 
elements,  providing  feedback  to  the  controller  and 
doing  whatever  is  necessary  to  keep  from  having  to 
move  data  in  and  out  of  the  processor.  This  is 
because  the  time  required  for  loading  the  data, 
which  is  roughly  equivalent  to  the  time  to  serially 
process  the  data  with  one  operation,  must  be 
included  in  the  total  processing  time.  In  order  to 
claim  any  significant  speed  increase  over  a  serial 
processor,  a  CAPP  must  be  able  to  average  the  data 
load  time  with  a  large  number  of  parallel 
operations.  One  way  of  achieving  this  is  to  reduce 
the  number  of  times  that  the  data  must  be 
transferred  in  and  out,  by  eliminating  the  need  to 
externally  evaluate  the  results  of  processing. 
This  can  be  done  by  providing  global  summary 
mechanisms  that  feed  back  to  the  controlling 
processor,  thereby  allowing  it  to  perform  the 
evaluation  of  the  processing  without  removing  the 
data  from  the  processor. 

3. )  DESCRIPTION  OF  THE  CAAPP 

The  CAAPP  would  consist  of  sixteen  thousand 
processing  elements  arranged  as  a  128  x  128  square 
array.  This  design  is  intended  to  be  expandable  up 
to  at  least  a  quarter  of  a  million  processing 
elements  in  a  512  x  512  array.  The  initial  16K 
processor  parallel  machine  will  have  an  effective 
operating  speed  several  hundred  to  a  thousand  times 
that  of  the  fastest  sequential  processor  available 
today.  The  CAAPP  would  be  connected  via  its  own 
controller  to  a  VAX-11/780,  LISP  machine,  or  some 
other  general  purpose  computing  machine  which  would 
provide  both  the  algorithm  development  environment 
and  the  operating  environment  for  the  system. 


The  machine  would  be  constructed  as  a  square 
grid  of  128  x  128  processing  elements  (or  cells  or 
P.E.s).  We  intend  to  extend  this  prototype  to  512 
x  512,  which  corresponds  to  the  usual  number  of 
pixels  in  a  digitized  image.  Each  cell  contains 
128  bits  of  storage,  five  register  bits,  and  a 
one-bit  ALU  for  bit  serial  arithmetic  and  logic 
functions.  Information  in  each  cell  can  be  moved 
North,  South,  East,  or  West  on  the  array  so  that 
neighboring  cells  car,  communicate  with  each  other. 
The  128  x  128  memory  array  is  controlled  by  a 
microprogrammed  controller  capable  of  issuing  an 
array  command  every  100  nanoseconds.  If  the 
controller  is  interfaced  to  a  VAX,  it  will  be  able 
to  receive  macro  instructions  from  the  VAX  as  fast 
as  the  VAX  can  issue  them. 

The  machine  allows  global  broadcast  from  the 
controller  to  all  cells,  an  activity  bit  set  by 
each  cell  for  its  response,  and  the  global  response 
from  the  array  of  cells  to  the  controller  in  terms 
of  a  count  or  some/none  functions.  In  particular, 
a  comparand  may  be  broadcast  from  central  control 
and  cells  whose  contents  fail  to  match  the 
broadcast  comparand  will  be  turned  off  so  that 
exact  match  to  comparand,  greater  than  (less  than) 
comparand,  maximum,  and  minimum  searches  may  be 
performed  ir,  parallel  or,  all  cells  of  the  memory. 

The  individual  chips  we  are  designing  will 
contain  64  cells  in  an  8x8  array  in  a  45-pin 
package  (Figure  4,  Table  1).  Each  PC  board  will 
contain  64  chips  in  an  8  x  8  array  (64  cells  x  64 
cells)  and  the  memory  as  a  whole  will  contain  4 
such  cards  (expandable  to  64  in  an  8x8  array) 
giving  the  overall  128  cell  x  128  cell  memory 
design.  Through  a  clever  organization  of  the 
architecture  we  have  managed  to  reduce  the  number 
of  off-card  connections  to  only  146  lines  per  card 
(Table  2),  thereby  eliminating  what  has  been  a 
major  source  of  unreliability  in  other  parallel 
processors. 

64  Processing  Element  Chip  64  Chip  PC  Card 
_ Pin  List _  Connection  List 


Instruction 

25 

North  Communication 

Communication 

8 

South  Communication 

Quadrant  Enable 

4 

East  Communication 

Comparand  In 

1 

West  Communication 

Some/None  Out 

1 

Column  Select 

Read/Hold  Count 

1 

Row  Select 

Shift/ Wait  Count 

1 

Instruction 

Clock 

2 

Comparand  In 

Power/Ground 

2 

Some/None  Out 
Read/Hold  Count 
Shift/Wait  Count 
Board  Count  Latch 
Board  Count 

Board  Count  Select 
Clock 

Power/Ground 

Total 

45 

Total 

The  chips  and  the  PC  cards  are  designed  to  be 
independent  of  the  overall  size  of  the  memory 
array.  All  array  size  dependent  functions  are 
implemented  in  a  single  column  and  row  of  "edge 
cards"  that  lie  conceptually  to  the  left  and  below 
the  leftmost  column  and  the  bottom  row  of  the 
array.  Thus  we  will  be  developing  not  only  a  large 
parallel  processor,  but  also  a  set  of  building 
blocks  that  can  be  easily  assembled  to  make  other 
special  purpose  machines  tailored  for  specific 
applications. 

Images  are  loaded  into  the  CAAPP  in  a 
parallel/serial  scheme,  one  scan  line  at  a  time. 
This  takes  1.64  milliseconds,  or  about  1/20th  of  of 
a  frame  time  for  a  16  bit  (color)  image.  In 
addition  to  the  array  movement  operations  and  the 
usual  content  addressable  functions,  three 
important  global  functions  are  included.  These 
are:  (1)  report  whether  any  of  the  cells  is  a 
"responder"  (has  1  ir,  its  X  register),  (2)  count 
number  of  “responders",  and  (j)  find  the  "first” 
responder.  Together  these  provide  the  key  to 
adaptive  processing  techniques.  For  example,  an 
image  enhancement  algorithm  could  adapt 
automatically  to  different  light  levels  in 
different  parts  of  the  image  in  order  to  extract 
the  same  amount  of  detail  from  all  parts  of  the 
image . 

Communication  Network 
for  64  Processing  Element  Chip 
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3.2  THE  PROCESSING  ELEMENTS 

Each  processing  element,  or  cell  (Figure  5)  is 
a  bit  serial  processor  consisting  of  a  bit  serial 
ALU,  128  bits  of  memory  (M),  local  (and  global) 
interconnection  hardware,  and  five  single  bit 
registers: 

-  X  The  primary  accumulator  bit,  which  is 
also  used  for  communications. 

-  Y  The  second  accumulator  bit. 

-  Z  The  carry  bit,  used  for  arithmetic 
operations. 

-  A  The  activity  bit,  used  for  enabling  and 
disabling  this  cell  on  any  given 
operation. 

-  B  The  secondary  activity  bit,  used  as  a 
temporary  storage  for  activity  "flags". 

Functional  Diagram  of  One  Processing  Element 
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Figure  6  shows  the  basic  micro-operations 
performed  by  each  cell,  instructions  are  of  the 
form:  "select  two  sources,  perform  some  function 
on  them,  and  store  the  result  in  some  destination." 
Instructions  involving  the  128  bit  memory  must  use 
the  same  location  for  read-mod ify-write  operations. 

It  is  interesting  to  note  that  this  machine 
has,  in  a  very  real  sense,  sixteen  thousand 
separate  parallel  processors.  As  an  SIMD  machine, 
it  can  readily  simulate  STARAN,  ILLIAC-IV,  or  any 
of  the  rectangular  array  systolic  machines  proposed 
by  Rung  [KUN82).  While  there  will,  of  course,  be  a 
loss  in  speed  due  to  simulation,  most  of  this  loss 
is  attributable  to  the  bit  serial  nature  of  the 
arithmetic  and  logical  operations.  In  most  cases 
this  is  more  than  compensated  for  by  the  greater 
parallelism  of  our  design. 

3.3  IMAGE  PROCESSING  ALGORITHMS 

Table  3  lists  28  algorithms  of  differing 
complexity  which  we  have  examined  during  the  course 
of  our  design  of  the  CAAPP.  This  is  by  no  means  an 
exhaustive  list,  however  it  does  show  the  wide 
variety  of  tasks  that  the  CAAPP  will  support. 
Table  3  indicates,  for  each  algorithm,  whether  or 
not  it  makes  use  of  intercell  communications,  the 
some/none  report  mechanism,  the  response  count 
mechanism,  broadcast  data,  and  local  cellular 
processing.  Of  course,  all  of  the  algorithms  use 
the  instruction  broadcast  mechanism. 


Algorithms  Used  During  CAAPP  Evaluation 
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Table  3 


What  Table  3  actually  points  out  is  which  of 
the  algorithms  use  purely  the  associative  aspect  of 
the  processor,  which  algorithms  use  purely  the 
square  grid  aspect,  and  which  algorithms  use  both 
aspects  of  the  processor.  Purely  associative 
algorithms  are  indicated  in  the  Table  by  a  "no"  in 
the  Intercell  Communications  column.  For  those 
algorithms  with  a  "yes"  in  the  Intercell 
Communications  column,  if  the  corresponding  entry 
in  both  the  Some/None  column  and  the  Fast  Count 
columns  is  "no",  then  the  algorithm  uses  purely  the 
square  grid  aspect  of  the  CAAPP.  Those  algorithms 
with  a  "yes"  in  the  Intercell  Communications  column 
and  a  "yes"  in  either  or  both  of  the  Some/None  and 
Fast  Count  columns  use  both  the  associative  and  the 
square  grid  aspects  of  the  CAAPP. 


entities  has  specific  attributes  (such  as  area  and 
extent  for  regions;  length  and  orientation  for 
segments).  There  are  also  specific  relations 
between  these  entities  (such  as  adjacencies  between 
regions  and  edges).  Associating  this 
representation  with  an  image  consists  of  the 
following  stages  of  processing: 

1)  An  image  is  loaded  into  the  CAAPP.  This 
involves  some  portion  (18  bits  assuming  a  512x512 
image)  of  the  128  bit  memory  of  each  CAAPP  cell. 

2)  Each  cell  has  its  coordinates  in  the  CAAPP  array 
computed  and  stored  in  another  18  bit  portion  of 
its  128  bit  memory  (required  for  a  512x512  image). 


*1.0  ICONIC  TO  SYMBOLIC  PROCESSING 


A  basic  step  in  the  functioning  of  autonomous, 
general  purpose  vision  systems  is  the  association 
of  symbolic  descriptions  with  the  results  of 
segmentation,  region,  and  edge  extraction 
processes.  Such  a  representation  acts  as  a  data 
base  which  is  accessed  by  recognition  and  grouping 
processes  to  determine  the  relations  between 
different  image  structures.  The  extraction  of  such 
a  representation  has  been  characterized  as  the 
iconic  (or  image)  to  symbolic  mapping  problem  and 
is  a  critical  issue  for  evaluating  proposed 
architectures  for  real  time  image  processing. 

There  are  several  examples  of  such  spatially 
tagged,  intermediate  level  symbolic 
representations:  the  primal  sketch  of  Marr 
[MAR82],  the  curvature  primal  sketch  of  Asada  and 
Brady  [ASA84],  the  RSV  structure  of  the  VISIONS 
system  [HAN78a,b],  the  patchery  data  structure  of 
Ohta  [0HT80],  Haralick's  [LAF82]  topographic 
classification  of  digital  image  intensity  surfaces, 
and  several  others.  These  representations  are 
organized  to  be  accessed  by  various  processes  for 
the  recognition  of  world  objects.  Since  world 
knowledge  and  control  strategies  in  vision  systems 
are  expressed  in  terms  of  symbolic,  relational 
structures,  these  representations  act  as  a 
necessary  level  of  interface  to  the  results  of  low 
level  image  operations. 

We  have  begun  investigating  iconic  to  symbolic 
processing  on  the  CAAPP  motivated  by  it's 
combination  of  features  from  both  associative 
processors  and  square  grid  array  processors  as 
discussed  above.  This  combination  gives  it  the 
capability  of  turning  raw  images  into  symbolic 
descriptions  in  the  same  memory  locations  where  the 
segmentations  from  low  level  processes  are  stored. 
In  this  section  we  will  present  one  example  of  the 
many  possible  iconic  to  symbolic  transformations  of 
image  data  possible  in  the  CAAPP. 

The  particular  representation  we  have  beer, 
developing  is  a  version  of  the  RSV  representation 
of  the  VISIONS  image  understanding  system 
[HAN78a,b]  ir.  which  the  basic  entities  are  Regions 
(connected  sets  of  pixels);  Segments  (portions  of 
the  contours  surrounding  regions);  and  Vertices 
(selected  points  along  contours).  Each  of  these 


3)  A  segmentation  procedure  is  applied.  We  have 
experimented  with  simple  segmentation  schemes  based 
upon  differences  of  Gaussian  convolutions  followed 
by  thresholding  to  extract  zero-crossings  [MAR82] 
and  histogram-guided  segmentation  techniques.  Both 
of  these  procedures  are  very  rapid  in  the  CAAPP  and 
can  be  made  selectively  sensitive  to  different 
ranges  of  contrast  and  spatial  frequency 
information.  The  segmentation  results  are  stored 
in  another  portion  of  each  CAAPP  cell  called  the 
region  property  field.  The  size  of  the  property 
field  depends  on  the  segmentation  technique.  For 
binary  regions  resulting  from  a  single  threshold 
only  1  bit  is  required.  For  general  segmentation 
techniques,  the  requirements  depend  on  the  number 
of  labels  associated  with  ranges  of  the  defining 
properties. 

4)  Points  along  the  boundaries  of  regions  are 
determined.  This  is  a  local  computation  over  the 
neighborhood  of  a  cell  to  determine  if  the  values 
in  its  region  property  field  are  different  than 
those  in  surrounding  cells.  At  this  stage  local 
edge  connectivity  and  the  number  of  adjacent 
regions  at  a  point  can  be  determined.  The  number 
of  adjacent  regions  at  a  point  is  called  the  local 
connectivity  type.  For  a  square  pixel  grid,  the 
local  connectivity  values  can  range  from  zero  to 
four. 

5)  An  operator  for  extracting  points  of  significant 
curvature  is  applied  to  the  extracted  boundary 
points.  This  is  a  local  computation  over  the 
immediate  neighborhood  of  a  boundary  point  and 
finds  points  along  a  contour  of  distinctive 
curvature.  These  points  are  tagged  as  being 
vertices  (the  V^  of  RSV)  and  will  be  treated  as  the 
endpoints  of  boundary  segments. 

6)  The  vertices  extracted  in  step  5  are  then 
treated  as  seeds  in  a  contour  message  passing 
process.  The  message  passing  involves  moving  the 
coordinates  of  each  vertex  along  the  boundaries 
that  intersect  with  it.  This  propagation  continues 
along  a  boundary  until  another  interesting  point  is 
encountered.  Collisions  of  these  vertex  labels  at 
the  mid-points  of  segments  are  also  tagged.  Since 
all  the  message  passing  is  occurring  in  unison,  the 
number  of  steps  is  maintained  as  a  global  count  and 
broadcast  to  the  particular  cells  at  which 
collisions  have  occurred.  This  is  a  measure  of 
contour  length. 


7)  After  stage  6,  each  vertex  point  contains  its 
coordinates  and  those  of  adjacent  vertices  to  which 
it  is  connected  along  some  boundary.  Additionally, 
each  tagged  midpoint  contains  the  coordinates  of 
the  endpoints  of  its  associated  segment.  The 
following  things  are  then  computed  in  parallel  from 
these  coordinate  values  at  the  vertices  and 
midpoints:  the  distance  between  the  endpoints,  the 
slope  of  the  line,  the  deviation  of  the  midpoint 
from  the  line  determined  by  the  two  endpoints,  and 
the  difference  between  the  number  of  steps  and  the 
summed  absolute  difference  of  the  components  of  the 
segment  endpoints  (these  correspond  to  measures  of 
the  goodness  of  the  linear  fit).  These  values  are 
stored  in  unused  cells  near  the  extracted  vertices. 
These  cells  are  tagged  as  being  either  midpoint  or 
segment  cells  (the  5  i_n  RSV) . 

8)  Independent  of  the  boundary  processing  in  steps 
5-7,  a  diffusion  process  is  also  applied  over  the 
values  in  the  region  property  fields  to  determine 
connected  components.  The  component  labels  are 
determined  from  the  coordinates  of  the  cells  in  the 
region.  Collisions  between  adjacent  cells  having 
the  same  values  in  the  region  property  field  are 
resolved  by  letting  the  one  with  the  least  row, col 
coordinate  be  the  dominating  label  in  the  region 
label  field.  The  particular  region  cell  having  the 
least  row,  col  component  is  called  the  Region  Cell 
and  is  the  CAAPP  location  where  information  about 
the  corresponding  region  is  stored  (The  R^  in  RSV). 

9)  We  can  then  step  sequentially  through  the 
extracted  image  structures  using  the  F ind  First 
Responder  operation  of  the  CAAPP.  This  operation 
selects  one  of  the  cells  that  is  currently 
responding  to  an  associative  query,  turning  off  all 
of  the  other  responders.  This  allows  processing  to 
take  place  on  that  single  cell  without  affecting 
any  of  the  other  responders.  The  cell  that  is 
selected  is  the  one  that  would  first  be  encountered 
if  the  CAAPP  array  were  to  be  scanned  in  normal 
raster  order.  For  example,  we  can  proceed 
sequentially  through  each  of  the  region  cells, 
extract  the  label  of  the  corresponding  region, 
broadcast  it,  find  the  responding  region  cells,  and 
then  compute  simple  region  properties  such  as  area, 
perimeter  length,  minimum  bounding  rectangle,  and 
so  forth  for  the  corresponding  region.  These 
region  values  are  then  stored  at  (or  near)  the 
corresponding  region  cell.  They  can  also  be 
broadcast  and  then  stored  in  locations  associated 
with  other  image  structures.  This  distributes  the 
information  and  makes  parallel  processing  of 
relational  queries  possible.  This  can  be  done  for 
each  region  or  only  regions  having  certain 
properties  such  as  particular  size,  shape,  or  image 
position. 

This  processing  results  in  attribute  lists  for 
particular  regions,  segments,  vertices,  and  labels 
associated  with  interior  region  and  boundary 
points.  The  global  broadcast  and  associative 
properties  of  the  CAAPP  then  allow  us  to  make 
queries  to  select  particular  image  structures,  to 
tag  them,  and  then  determine  relations  between 
them. 


We  now  discuss  the  details  of  the  association 
of  these  symbolic  labels  with  the  raw  image  in  some 
detail  and  also  discuss  the  relational  processing 
that  can  occur  with  respect  to  them  in  the  CAAPP. 

4.1  LOCAL  EDGE  AND  VERTEX  REPRESENTATION 

Images  are  composed  of  pixels  which  are 
separated  by  edge-links  that  meet  at  a  point. 
Associated  with  each  CAAPP  processor  are  the  values 
at  the  corresponding  pixel  in  the  image,  the  point, 
and  the  four  edge  links  incident  with  the  point 
(Figure  7).  The  redundant  storage  of  local  edge 
link  connectivity  simplifies  massage  passing  along 
contours. 


Figure  7 


The  128  bits  in  each  CAAPP  cell  are  initially 
decomposed  into  fields  indicated  in  Table  4.  Each 
processor  has  its  coordinates  in  the  image  array 
stored  with  it.  The  coordinate  grid  is  created  in 
the  CAAPP  processing  array  by  a  series  of 
broadcasts  from  the  controller  using  different 
patterns  of  row  and  column  enable  signals,  followed 
by  a  series  of  response  shifts  using  the  on-chip 
communication  network.  Loading  the  complete  512  by 
512  coordinate  grid  takes  12.7  microseconds.  The 
storage  of  these  coordinates  is  necessary  for 
computing  geometric  relations  among  sets  of 
selected  pixels  and  is  a  source  of  unique  labels 
for  connected-components  analysis  by  having  the 
label  associated  with  a  region  be  the  coordinates 
of  its  upper-most,  left-most  pixel.  Note  that  the 
required  number  of  bits  for  this  attribute  are  a 
function  of  image  size.  The  raw  image  values 
correspond  to  the  actual  image  which  is  being 
operated  upon.  Again,  fewer  bits  than  18  can  be 
used,  or  the  raw  image  can  be  removed  from  storage 
once  a  segmentation  has  been  obtained  and 
processing  is  restricted  to  it.  The  property  field 
corresponds  to  the  value  upon  which  the 
segmentation  is  performed,  such  as  the  response  to 
some  texture  measure  or  the  domain  of  some 
histogram.  The  region  label  field  is  used  for  the 
connected  components  analysis  and  must  be  of  the 
same  dimension  as  the  coordinates  of  the  image. 
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FIELD  BITS 


pixel  location  (row, col  coordinates)  18 
raw  image  value  (6  bit  R,G,B)  18 
region  property  18 
region  label  18 
local  connectivity  4 
R,  S,  V  or  Midpoint  2 
working  area  80 


Table  4 


The  working  area  is  used  in  several  different 
ways.  For  example,  our  implementation  of  contour 
message  passing  requires  44  bits  (8  for  storing 
local  edge  connectivity  traversals  and  36  for  two 
sets  of  pixel  coordinates).  The  interest  operator, 
depending  upon  resolution  and  the  number  of 
iterations,  can  require  from  24  to  48  bits. 
Several  of  the  queries  require  multiple  tag  bits  to 
be  set  in  particular  combinations.  Note  that  all 
these  fields  are  not  needed  simultaneously.  Many 
are  dependent  upon  the  segmentation  being  performed 
and  need  not  be  used  for  anything  until  the 
segmentation  has  been  obtained.  The  contour 
message  passing  occurs  after  the  interesting  points 
have  been  extracted.  However  it  is  obvious  that 
larger  image  memory  is  needed  in  more  complex 
processing,  if  memory  swapping  between  host  and  the 
CAAPP  array  is  to  be  avoided. 

4.2  SEGMENTATION 


As  mentioned  previously,  we  have  explored  two 
types  of  segmentation  procedures  on  the  CAAPP: 
Histogram  Guided  Segmentation  and  Zero-crossing 
extraction  using  DOGs  (Difference  of  Gausslans). 

Histogram  based  segmentation  [OHL75,  0HL)8, 
NAG79,  NAG82.  KOH83.  REY84]  on  the  CAAPP  Involves 
forming  a  histogram,  determining  what  region  labels 
to  associate  with  the  different  ranges  in  the 
histogram  based  upon  its  peaks  and  valleys,  and 
then  broadcasting  the  determined  labels  for  the 
particular  ranges  of  pixel  values.  Forming  the 
histogram  uses  the  Select  Less  Than  and  Count 
Responders  micro  sub-routines  to  select  ranges  of 
values  (buckets)  starting  with  the  lowest  range  and 
working  up  to  the  highest  range.  For  example,  if 
the  range  of  each  bucket  is  taken  to  be  the  maximum 
range  divided  by  the  number  of  buckets,  then  the 
time  for  the  algorithm  is  dominated  by  the  time  to 
perform  the  count  responders  operation  for  each 
bucket.  The  time  is:  (Number  of  Buckets)  *  1.6  ♦ 
0.9  microseconds;  A  256  bucket  8-bit  histogram 
would  take  410.5  microseconds.  The  histogram  is 
collected  in  the  central  control  which  collects  the 
counts  for  each  bucket.  The  resulting  histogram  is 
a  global  histogram  for  the  whole  image.  Labeling 
the  histogram  takes  place  off  the  CAAPP  In  the  host 
processor . 


Experience  has  shown  that  global  histograms  are 
ineffective  in  images  with  complex  information.  We 
have  not  explored  all  the  requirements  for 
computing  localized  histograms  and  their  subsequent 
remerging  in  the  CAAPP,  nor  the  inherently 
sequential  process  of  recursive  resegmentation. 
Both  of  these  approaches  should  be  significantly 
enhanced  by  incorporating  an  array  of  localized 
controllers  in  the  CAAPP  design. 

For  the  initial  experiments  described  here,  we 
have  used  the  segmentations  which  result  from 
thresholding  the  difference  of  Gausslans.  Such 
segmentations  are  binary  images  and  allow  for 
simple  storage  of  boundaries  since  there  are  only 
two  types  of  regions  (greater  or  less  than  the  zero 
threshold).  For  such  binary  segmentations,  there 
are  no  type  1  or  3  connectivities  and  those  of  type 
4  can  be  removed  by  removing  region  labels  around 
type  4  connectivity  cells.  Thus,  it  is  possible  to 
have  single  boundaries  with  only  type  2 
connectivities.  The  modifications  for  other  types 
of  segmentations  require  redundant  storage  for 
boundaries  of  adjacent  regions.  This  involves  no 
new  processes  in  our  procedures,  but  does  require 
the  allocation  of  additional  storage  in  CAAPP 
memory  cells  so  that  the  boundary  of  each  region  is 
uniquely  represented.  A  simple  technique  for  this 
is  to  remove  the  outermost  pixels  from  each  region 
to  give  each  region  a  unique  boundary  of  type  2 
connectivity  only.  The  type  2,3.  and  4  vertices 
between  these  boundaries  are  then  used  to  store 
region  adjacency  information.  This  does,  however , 
destroy  regions  of  single  pixel  widths. 

There  are  three  basic  steps  in  Zero-crossing 
extraction  using  a  difference  of  Gausslans: 

1)  Convolve  with  Gausslans  of  different  widths  and 
store  results  separately 

2)  Subtract  Results 

3)  Threshold  at  zero  to  yield  contour  and  binary 
regions 

This  is  performed  on  the  CAAPP  by  taking 
advantage  of  the  fact  that  for  Gaussians  and  other 
smoothing  operations,  convolutions  with  large  masks 
are  equivalent  to  multiple  convolutions  with 
smaller  masks.  Thus,  processing  involves 
convolving  an  image  with  a  mask  some  number  of 
times  and  storing  the  result,  and  then  continuing 
the  convolution  further  and  subtracting  the  result 
from  the  one  previously  stored.  Among  the  masks 
that  could  be  used  are  those  in  Table  5. 

On  the  CAAPP  convolution  is  a  "macro"  level 
operation  programmed  in  terms  of  the  basic  "micro" 
CAAPP  operations.  A  discrete,  two  dimensional, 
convolution  is  based  on  a  mask  of  multipliers  that 
each  cell  applies  to  its  local  neighborhood, 
forming  the  sura  of  the  pairwise  products  of  the 
cell's  neighbors  with  their  corresponding  mask 
values.  Typically  the  sum  is  then  scaled  in  some 
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Uniform  Weighting 


Burt's  Kernel 


1 

1  1 

.0025 

.0125 

.02 

.0125 

.0025 

1 

1  1 

.0125 

.0625 

.1 

.0625 

.0125 

1 

1  1 

.02 

.1 

.16 

.1 

.02 

.0125 

.0625 

.1 

.0625 

.0125 

.0025 

.0125 

.02 

.0125 

.0025 

Table  5 

manner , 

and  the 

resulting 

value 

is  used  to 

update 

the  value  in 

the  cell. 

The 

algorithm 

for  the 

convolution  can 

be  described  as 

the 

actions 

of  a 

single  cell  with  the  understanding  that  each  action 
is  performed  simultaneously  by  all  of  the  cells. 

For  the  CAAPP  each  cell  distributes  its  own 
data  to  every  cell  in  the  neighborhood.  Because 
every  other  cell  is  also  doing  this,  the  end  result 
is  that  the  central  cell  (and  hence  all  cells)  gets 
the  data  it  needs  from  all  of  the  cells  in  the 
neighborhood.  The  data  distribution  path  is  a 
rectangular  spiral  out  from  the  center  cell.  It 
should  be  noted  that  the  time  required  to  perform  a 
convolution  using  the  CAAPP  is  independent  of  the 
size  of  the  image  (assuming  the  image  is  no  larger 
than  the  array)  and  only  dependent  upon  the  area  of 
the  convolution  mask.  Since  the  CAAPP  does 
cell-level  arithmetic  bit-serially,  the  size  of  the 
data  values  also  affects  the  speed  of  the 
algorithm. 

A  worst  case  estimate  of  the  time  required  for 
such  convolutions  can  be  obtained  from  the  formula: 

T  =  P*(0.8  •  N+0.2  *  M+0.1)  + 

0.3  •  M  •  (N**2  *  P+N+1) 


where  T  is  the  time  in  microseconds,  N  is  the 
number  of  bits  in  a  pixel  value,  M  is  the  number  of 
bits  in  a  mask  value  and  P  is  the  number  of  pixels 
in  the  mask  area.  Under  normal  circumstances  T 
will  be  about  half  of  the  value  obtained  from  the 
formula.  For  8  bit  pixel  values,  this  would  give 
times  of: 


Mask  Size 


Time  (milliseconds) 


3x3  0.7 

5x5  2.1 

7x7  4.0 

11  x  11  9.9 


As  an  example,  the  binary  segmentation 
resulting  from  the  difference  between  the  image  in 
Figure  8  with  the  smoothed  image  derived  from  it 
after  eight  successive  convolutions  with  Burt's 
kernel  (BUR821  is  shown  in  Figure  9. 


Figure  8 


Figure  9 

4.3  LOCAL  EDGE  AND  VERTEX  PROCESSING 


The  next  stage  of  processing  determines  which 
pixels  are  adjacent  to  boundaries  and  what  the 
local  edge  connectivity  is.  This  operation  is 
based  upon  the  simple  comparisons  in  the  following 
four  steps  applied  to  all  cells  (described  with 
respect  to  local  connectivity  of  cell  A  in  Figure 
10) ; 
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Figure  10 


if  (region_property_field_at_A  .ne. 
region_property_field_at_B)  => 

Edge_bit_tag(North)  of  A  s  1  else  0 

if  (region_property_field_at_A  .ne. 
region  _property_field_at_D)  => 

Edge_bit_tag(West)  of  A  ;  1  else  0 

Edge_bit_tag( East )  of  A  = 

Edge_bit_tag(West)  of  B 

Edge_bit_tag(South)  of  A  = 

Edge_bit_tag( North)  of  D 

The  local  connectivity  type  is  the  number  of  set 
edge  bit  tags. 

ill  INTERESTING  POINT  EXTRACTION 


The  next  processing  stage  extracts 
"interesting"  points  along  a  boundary.  These  are 
generally  points  of  high  or  changing  curvature. 
The  procedure  we  utilize  on  the  CAAPP  is  tunable 
for  different  ranges  of  curvature.  It  also 
involves  only  local  operations  over  a  boundary 
point  and  its  immediate  neighbors  to  allow  for  very 
rapid  processing. 

Processing  begins  by  associating  an  orientation 
vector  with  each  vertex  along  the  boundary  using 
the  rules  indicated  in  Figure  11  at  the  four 
different  boundary  orientations  possible  on  a 
square  grid.  Once  an  orientation  vector  nas  been 
associated  with  each  point  along  the  boundary,  each 
point  re-expresses  its  orientation  vector  by 
averaging  the  row  and  column  components  of  its 
vector  with  its  immediate  neighbors  along  boundary. 
The  number  of  iterations  of  this  averaging  process 
corresponds  to  weighted  evaluation  of  curvature 
over  different  neighborhood  sizes  along  the 
boundary. 


Interesting  points  are  extracted  where 
significant  changes  and  variations  in  orientation 
occur.  This  is  measured  by  the  sum  of  the  absolute 
differences  between  the  row  and  column  components 
of  the  orientation  vectors  at  the  two  neighboring 
boundary  points.  The  interesting  points  are  the 
local  maxima  in  this  measure  which  also  exceed  some 
small  threshold. 

The  determination  of  the  orientation  variance 
measure  can  also  be  performed  with  respect  to  a 
particular  direction  along  a  curve.  In  this  case 
it  is  necessary  to  assign  an  orientation  (clockwise 
or  counter-clockwise)  to  points  along  the  boundary 
of  the  region.  One  way  to  do  this  is  to  select 
some  point,  choose  a  direction,  and  walking  around 
the  curve  until  the  point  is  encountered  again. 
This  is  a  serial  and  time  consuming  operation.  It 
is  possible  to  determine  boundary  orientation 
through  a  completely  local  procedure  since  each 
point  knows  its  direction  relative  to  the  interior 
of  the  region  and  the  global  North,  East,  West, 
South  directions  on  the  CAAPP.  To  do  this,  each 
initial  orientation  vector  at  a  boundary  point 
samples  the  edge  link  connectivity  moving  in  the 
N,E,S,W  directions  (clockwise)  until  a  set  one  is 
encountered.  This  assigns  a  unique  neighbor  to 
each  boundary  point  and  thus  orients  the  curve  (see 
Figure  12).  This  direction  can  be  stored  to 
simplify  further  message  passing  and  walking  along 
a  contour.  Additionally,  given  an  orientated 
curve,  the  direction  of  curvature  can  be  determined 
by  projecting  the  difference  vector  between 
adjacent  orientation  vectors  onto  the  direction  of 
boundary  orientation. 


Figure  11 
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Figure  12 


Figures  13a, b,c  show  the  positions  of  the 
intersting  points  with  respect  to  the  contours  from 
figure  9  after  25,  50,  and  125  iterations  of  the 
averaging  procedure.  The  average  orientation 
vector  length  rapidly  decreases  for  very  small 
regions.  Figures  1>la,b,c  show  the  linear  segments 
between  the  extracted  points  for  the  corresponding 
number  of  iterations. 

Figure  15a, b,c  shows  the  positions  of  the 
interesting  points  from  the  contour  segments  along 
what  is  basically  the  house  roof  for  the  different 
iterations.  Figures  16a, b,c  show  the  successive 
values  of  the  measure  of  difference  between  the 
orientation  vectors  neighborhoods  along  this 
contour  and  Figures  17a, b,c  show  the  linear 
segments  between  these  points.  The  interpolated 
smoothed  curves  derived  from  the  orientation 
vectors  are  shown  in  Figures  18a, b,c. 

4.5  SEGMENT  EXTRACTION  VIA  MESSAGE  PASSING 


Segments  are  then  extracted  between  interesting 
points.  This  is  performed  by  having  each  vertex 
send  out  its  coordinates  along  its  incident 
boundaries.  These  propagate  until  another 
interesting  point  is  encountered.  The  number  of 
steps  is  maintained  by  a  global  step  count  for  a 
measure  of  contour  length.  Finally,  when  each 
interesting  point  has  been  collided  with  an 
appropriate  number  of  time  (as  determined  by  it's 
local  connectivity),  attributes  such  as  slope, 
distance,  and  linear  deviation  are  computed  in 
parallel  for  each  line  segment.  For  endpoints, 
linear  deviation  is  the  difference  between  the 
number  of  steps  between  interesting  points  and  the 
sum  of  the  component-wise  absolute  valued 
difference  of  the  endpoints. 


Messages  between  interesting  points  are  passed  in 
parallel  for  all  points  along  the  perimeter  of  each 
region. 

1)  The  operation  begins  by  copying  the  coordinates 
of  each  interesting  point  into  two  (redundant  to 
start  with)  fields  in  the  processing  elements  (P. 
E.)  which  hold  such  points.  We  call  these 
message-1  and  message-2.  We  call  the  coordinate 
data  being  sent,  messages.  Those  P.E.'s  also  set 
flags  in  a  bit  vector  (D-Out)  representing  the  four 
possible  directions  that  the  point  could  have  arcs 
on.  There  must  be  exactly  two  such  arcs.  This  is 
true  at  all  points,  not  just  the  "interesting" 
ones,  since  we  have  already  eliminated  all  but  type 
2  vertices. 

2)  For  all  P.E.'s  with  Any  D-Out  bits  set  we  send 
out  single  bit  flags  to  the  two  neighbors  on  the 
arcs  adjacent  to  them.  This  operation  is  done  in 
parallel,  but  once  each  for  the  four  directions 
with  a  test  for  each  of  four  possible  D-Out  bits. 

3)  For  all  P.E.'s  on  a  edge,  we  check  to  see  if  we 
have  bits  set  on  any  of  our  neighbors  adjacent  to 
us,  on  arcs  which  define  the  edge  of  our  region. 
If  we  do  we  set  the  corresponding  (D-In)  flags. 
Again  every  P.E.  will  "look"  in  each  of  the  four 
directions.  There  will  be  none,  one,  or  two  bits 
set.  If  two  D-In  bits  are  set,  we  have  a 
"collision"  which  means  this  vertex  is  at  the 
midpoint  of  two  interesting  points,  we  record  that 
fact  in  a  separate  flag  in  the  P.  E.  memory. 

4)  For  each  of  the  "n"  bits  which  represent  the 
messages  being  sent,  and  for  each  of  the  four 
directions  (North,  South,  East,  West),  we  perform 
the  following  operations,  indexed  by  "i": 

-  If  the  D-Out  bit  for  this  direction  is 
set,  and  there  is  only  one  D-out  bit  set, 
pick  up  the  ith  bit  of  the  message  from 
memory  field  message-1  and  send  it  out. 
If  this  is  the  second  of  two  D-Out  bits 
set  then  use  the  message-2  field. 

-  If  the  D-In  bit  is  set  for  the 
corresponding  direction  (South  for  North 
etc.)  and  there  is  only  one  D-In  bit  set, 
pick  up  the  ith  bit  of  the  message  and 
store  it  the  memory  field  message- 1  of  the 
P.  E.  If  this  is  the  second  D-In  bit  set 
store  the  message  bit  in  message-2. 


5)  After  all  "n"  bits  of  the  current  messages  have 
been  sent,  we  map  D— In  bits  to  D-Out  bits  such  that 
messages  that  came  in  from  the  North  go  out  to  the 
South.  We  do  this  for  all  vertices  that  are  not 
interesting  points.  Messages  will  stop  traveling 
at  interesting  points. 
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Figure  15a 


Figure  15b 


Figure  15c 


Figure  1Yc 


Figure  18c 
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6)  If  there  are  no  vertices  with  D-Out  bits  set 
then  all  messages  have  reached  the  end  of  their 
travel  and  the  algorithm  terminates. 

4.5  REGION  EXTRACTION 


Region  extraction  determines  connected  sets  of 
pixels  with  identical  values  in  their  segmentation 
property  field  and  particular  spatial  properties  of 
these  connected  sets  (area,  perimetc..  length, 
minimum  bounding  rectangle).  The  connected 
components  operation  is  done  by  locally  propagating 
labels  in  parallel  from  points  in  regions.  The 
labels  we  use  are  the  processor  coordinates  of  the 
region  points  with  conflicts  resolved  by  letting 
the  label  corresponding  to  the  least  row,  col 
component  dominating  This  operation  would  take 
about  10  milliseconds  for  all  regions  less  than  128 
pixels  in  diameter,  with  the  processing  occurring 
simultaneously  over  all  regions  [WEES'*]. 

4.6  QUERIES 

Once  a  symbolic  intermediate  level 
representation  has  been  formed,  high  level 
processing  can  be  initiated.  For  example,  object 
hypothesis  can  be  formed  via  rule-oriented 
techniques  [WEY83.84],  [RIS8**].  This  could  involve 
broadcasting,  sequentially,  the  range  of  a  set  of 
expected  feature  values  for  a  particular  object  and 
setting  weighted  responses  for  regions  which  match 
these  feature  values.  While  the  broadcasting  of 
object  features  proceeds  sequentially,  each  will 
take  place  very  quickly  so  that  the  strongest 
responders  for  several  objects  can  be  found  in  a 
single  frame  time. 

Some  simple  examples  of  these  queries  could  be 
that  large,  blue,  regions  in  the  upper  portion  of 
the  image  with  low  texture  might  be  a  sky  segment. 
If  candidate  regions  were  located  with  a  reasonable 
degree  of  confidence,  we  might  hypothesize  that  the 
region  below  sky  with  long  straight  edges  is  a 
roof;  while  green  segments  with  high  texture  of 
random  line  orientation  could  be  foliage.  Of 
course,  these  simple  queries  should  not  be  viewed 
as  a  solution  to  the  interpretation  problem. 
Rather,  these  types  of  operations  are  available  to 
a  high  level  knowledge-based  and  rule-oriented 
control  structure  as  a  focus-of-attention  mechanism 
to  allow  it  to  quickly  form  object  hypotheses  which 
can  then  be  processed  more  carefully.  Spatial 
relations  between  objects  can  be  tested  and 
particular  regions  can  be  examined  in  more  detail 
during  verification. 

A  basic  type  of  query  is  to  determine  a  set  of 
lines  having  some  attribute  in  common.  Once 
extracted,  these  can  be  used  to  initialize  and  then 
constrain  an  interpretation.  For  example,  the  set 
of  long  straight  lines  of  some  range  of  contrast 
are  significant,  especially  in  interpreting  images 
of  man-made  objects  such.  A  sequence  of  CAAPP 
operations  to  do  this  is; 


1)  Apply  a  simple  contrast  measure  at  points  along 
a  region  boundary,  such  as  the  absolute  value 
difference  between  image  pixels  on  opposite  sides 
of  the  region  boundary. 

2)  Threshold  this  measure  to  remove  low  contrast. 
Tag  these  positions 

3)  Assuming  that  the  contour  has  been  oriented  (see 
section  4.4)  and  the  appropriate  directions  stored 
at  each  boundary  position,  walk  from  each  vertex 
point  the  oriented  direction  keeping  count  of  the 
number  of  tagged  bits  set  in  step  2  that  are 
encountered.  Incrementing  the  count  is  stopped  and 
stored  when  another  interesting  point  is 
encountered. 

4)  The  contrast  along  the  segment  is  a  ratio  of  the 
number  of  tagged  cells  encountered  and  the  segment 
length. 


For  example,  Figure  19a  shows  the  linear  segments 
which  have  exceeded  a  minimal  contrast  threshold. 
Figure  19b  shows  a  further  threshold  on  contour 
length. 

Another  important  task  is  combining  region  and 
line  elements  in  the  intermediate  representation 
into  particular  geometric  Figures.  This  can  be 
done  by  actually  broadcasting  the  Figure  and 
checking  segment  attributes  in  the  corresponding 
areas.  A  simple  example  of  this  is  for  merging 
straight  lines.  Initially,  some  line  segment  is 
selected  based  upon  attribute  such  as  it's  length, 
position,  contrast,  or  the  current  state  of  the 


Figure  19a 
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Figure  19b 


deterDinatlon  of  image  relationships.  From  this, 
the  equation  of  the  line  is  extracted  and  the  cells 
in  the  CAAPP  which  are  within  some  distance  of  this 
line  are  tagged.  This  is  essentially  masking  an 
area  in  the  CAAPP  for  further  processing  and 
involves  tagging  those  set  of  processor  which  have 
coordinates  satisfying  a  set  of  broadcast 
geometrical  relations,  Next,  any  line  segments  in 
this  area  are  tagged.  Of  those  which  are 
extracted,  the  ones  with  the  correct  orientation 
relative  to  the  initial  extracted  line  are  found. 
These  may  then  be  linked  together,  conditional  on 
other  attributes  (similar  region  adjacencies  or 
contrast).  Simple  shape  analysis  can  also  proceed 
sequentially  once  a  line  set  as  been  extracted  and 
a  particular  one  selected.  The  area  surrounding 
its  endpoints  is  tagged  and  the  occurrence  of  any 
other  lines  from  the  line  set  in  these  areas 
tagged,  and  whether  it  is  in  a  geometrical  relation 
consistent  with  a  particular  object.  If  so,  the 
processing  continues,  otherwise  another  segment  is 
selected. 

A  basic  question  is  when  to  do  things  in 
parallel  in  the  CAAPP  via  local  message  passing  and 
when  to  use  the  global  broadcast  mechanism  to 
distribute  information  and  focus  processing  on 
particular  image  structures.  The  answer  to  this 
question  will  ultimately  depend  on  the  form 
high-level  vision  systems  take.  It  is  generally 
known  that  visual  Interpretation  is  difficult 
unless  a  context  can  be  quickly  established, 
perhaps  by  heuristic  strategies  (as  in  the  use  of 
schemas  in  [WEY84]).  This  implies  that  much  of  the 
processing  is  highly  focused  and  prediction-driven. 
This  would  mean  highly  directed  queries  to  the 
CAAPP  after  some  initial,  rough  and  global 
operations  to  establish  a  context.  Additionally, 
the  need  for  segmentations  at  different  levels  of 
detail  and  over  multiple  attributes  is  obvious. 
For  example,  the  line  representation  used  by  Burns 
CBUR84]  and  the  various  segmentation  procedures  we 
have  developed  can  be  consistently  represented  in 
the  RSV  structure  and  the  effects  of  combining  them 
is  only  beginning  to  be  explored. 


5.0  CONCLUSIONS 


The  architecture  presented  here  is  novel, 
buildable,  and  opens  up  a  whole  range  of  very  high 
speed  computation  at  the  more  difficult  levels  of 
machine  vision.  Both  characteristics  of  the  CAAPP, 
array  processing  and  associative  processing,  seem 
to  be  fundamental  to  parallel  vision  algorithms. 
These  capabilities  allow  communication  which 
supports  both  global  and  local  processing,  which  in 
turn  supports  transformations  between 
representations  of  the  image  in  a  highly  flexible 
manner . 
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ABSTRACT 

A  preliminary  survey  was  conducted  on  commercially 
available  array  processors  Array  processors  from  each  major 
manufacturer  were  chosen  for  comparison.  The  computation 
powers  of  these  machines  range  from  one  mega  floating  point 
operation  per  second  (Mffop)  to  a  hundred  Mflops.  Prices  range 
from  a  few  thousand  dollars  for  a  single  board  device  to  a  quarter 
million  dollars  for  a  100  Mflops  machine.  A  comparison  of  price, 
performance,  hardware  architecture,  software  availability  and 
input/output  interface  are  summarized.  Users’  comments  on  most 
of  the  machines  are  included. 

INTRODUCTION 

High  performance  array  processors  are  designed  to  attach 
to  a  general  purpose  host  computer.  The  combination  of  a  host 
computer  and  an  array  processor  can  sometimes  provide  cost 
effective  scientific  compulations.  For  a  minicomputer  or  a 
superminicomputer  host,  processing  speeds  common  only  in  the 
world  of  Cray  and  Cyber  machines  can  sometimes  be  achieved. 
Although  this  performance  is  restricted  to  repetitive  arithmetic 
functions  the  array  processors  come  at  prices  much  less  than 
those  of  general  purpose  computers. 

In  evaluating  array  processors,  two  considerations  are 
important  First,  quoted  performance  is  usually  exaggerated  and 
achievable  only  for  special  computations,  typically  the  fast  fourier 
transform  (FFT)  Quoted  operations  are  not  multiply  adds,  but 
multiplies  plus  adds  For  the  FFT.  there  are  two  adds  per  multiply. 
Thus  15  Mflops  means  5  million  multiplies  for  a  machine  targeted 
for  the  FFT.  with  one  multiplier  and  two  adders  Such  a  machine 
would  reach  only  10  Mflops  for  convolution.  Second, 
programming  array  processors  requires  typically  one  or  two 
orders  of  magnitude  more  time  than  general  purpose  computers 


because  of  lack  of  software  development  tools  and  because  of 
hardware  particularities  of  array  processors.  Software 
environments  are  a  key  issue  in  selection.  We  hope  to  provide 
more  guidance  based  on  our  experience  with  programming  one  or 
more  machines  in  later  versions  of  this  document. 

We  have  not  considered  64  bit  floating  point  operations. 
They  are  important  for  many  applications  but  not  for  vision,  in  our 
opinion. 

Mainframes  and  microcomputers  are  general  purpose 
machines  designed  to  perform  general  purpose  tasks.  Array 
processors,  on  the  other  hand,  are  designed  to  handle  massive, 
complex  and  repetitive  calculations.  The  essence  of  array 
processors  is  streamlining  -  breaking  the  operation  into  small 
steps,  conducting  simultaneous  operations. 

Synchronous,  parallel,  pipelined  architecture  is  the 
mainstream  design  of  array  processors.  The  best  representatives 
of  the  genre  are  32  bit  machines  that  perform  full  floating  point 
calculations,  (hough  the  Floating  Point  Systems,  which  accounts 
for  70%  of  the  array  processors'  market,  uses  a  38  bit 
configuration  as  its  standard.  Most  array  processors  are  designed 
to  optimize  processing  speed  for  the  FFT,  which  is  the  most 
popular  application  using  array  processors  Some  manufacturers 
choose  lo  use  lixed/block  floating  point  processors  to  gain  extra 
processing  speed. 

Most  array  processors  are  designed  to  work  independently 
from  the  host  once  the  program  and  data  are  down  loaded,  thus 
freeing  the  host  for  other  purposes.  Format  conversion  of  data  is 
generally  implemented  on  the  host  interface  board.  ST  100 
chooses  to  do  the  format  conversion  internally  so  as  to  maintain 
its  100M  byte/sec  input/output  data  transfer  rate. 
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Most  of  the  array  processors  come  with  a  high  level 
programming  language  which  is  compatible  with  Fortran  77. 
These  high  level  languages  incur  certain  inefficiency,  perhaps  a 
factor  of  two  over  assembly  code.  They  usually  include  a  library  of 
pre  coded  macros  to  facilitate  process  development  anj  provide  a 
macro  assembly  language  for  users  desiring  to  code  their  own 
macros.  Added  to  these,  are  simulation  and  debugger  facilities  for 
macro  and  process  debugging,  plus  a  host  computer  interface 
which  allows  transportation  of  the  array  processor  from  one  host 
computer  to  another  without  reprogramming. 

Different  application  software  packages  are  available  for 
each  machine.  Most  of  them  include  standard  mathematics, 
signal  processing,  image  processing,  geophysical  processing  and 
simulation  libraries. 

STAR  TECHNOLOGIES  INC. 

ST  100  is  an  array  processor  from  Star  Technologies  Inc..  It 
uses  synchronous,  parallel,  pipelined  architecture.  Its 
architecture  employs  multiple  processors  and  a  hierarchical  high 
capacity  data  memory  system.  Bipolar  VLSI  circuits  are  used  as 
the  building  blocks.  The  machine  cycle  is  40  ns;  it  can  achieve  up 
to  100  Mflops,  which  is  the  fastest  comparable  machine  on  the 
market  and  almost  twice  the  speed  of  the  second  fastest  machine 
in  this  survey. 

Its  multiple  processors  design  results  in  a  more  general 
purpose  computer  code  executing  in  a  control  processor,  which 
consists  of  two  Motorola  68000  microprocessors,  and  specialized 
microprocessor  code  which  executes  the  special  purpose 
processors.  It  has  an  MIMD  (multiple  instruction,  multiple  data) 
architecture.  Its  multilevel  program  structure  and  the  hierarchical 
memory  structure  enable  efficient  usage  of  both  host  computer 
and  array  processor  resources.  They  allow  the  array  processor  to 
perform  a  hierarchy  of  concurrent  arithmetic  and  data  movement 
operations  independently  without  the  need  for  host  computer 
intervention. 

Within  the  arithmetic  processor,  there  are  two  add/subtract 
units,  two  multiply  units  and  a  divide/square  root  unit.  The  adders 
and  multipliers  are  three  step  pipelines,  with  each  step  executing 
at  the  clock  rate  of  40  ns.  The  divide/square  root  section  is  non- 
pipelined  and  requires  13  clock  periods  to  compute.  When  the 
adders  and  multipliers  are  all  in  full  operation,  for  example  in 
convolution,  the  machine  can  operate  at  100  Mflops. 


To  provide  for  very  high  memory  bandwidth,  two  features  of 
concurrency  were  introduced  in  ST- 100  memory.  The  first  feature 
is  a  very  wide  access  path  and  the  second  is  a  highly  interleaved 
memory.  These  give  an  aggregate  data  rate  of  up  to  100M 
byte/second. 

Another  significant  feature  of  ST- 100  is  that  it  can  attach  to 
and  service  multiple  host  computers  concurrently.  This  is  made 
possible  by  its  production  software  which  allows  concurrent 
staging  and  execution  of  processes,  enabling  subsequent 
processors  to  be  loaded  and  readied  for  execution  while  the 
current  process  executes.  Its  input/output  subsystem  is 
expandable  to  include  independent  input/output  processors  for 
each  host  system.  Its  maintenance  software  provides  off  line  and 
remote  diagnostic  capability. 

The  ST- 100  software  system  consists  of  three  major 
segments.  The  Development  Software  System  is  a  set  of  Fortran 
programs  residing  in  the  host.  It  provides  the  ability  to  separately 
program  all  elements  of  the  ST- 100.  allowing  multiple  levels  of 
program  optimization.  These  programs  can  be  used  to  create 
application  library  modules.  The  simulator  supports  both  the 
Fortran  and  micro  code  levels  simulations  while  the  debugger 
debugs  micro  codes  only.  The  Production  Software  System 
couples  the  array  processor  to  the  host  application  program.  It 
manages  the  allocation  of  array  processor  resources  and  directs 
requests  from  the  user's  application  program  to  the  control 
processor.  It  also  schedules  and  controls  requests  from  both 
multiple  users  and  multiple  hosts.  In  addition  ST- 100  provides  a 
Maintenance  Software  System  which  includes  a  set  of  fault 
identification  and  isolation  routines  for  diagnostic  purposes. 

The  VAST  software  tool  (the  Vector  and  Array  Syntax 
Translator)  is  designed  to  analyze  DO  loops  in  standard  Fodran 
programs  and  convert  those  loops  for  which  vectorization  is 
possible  into  array  operations.  VAST  creates  a  listing  of  the  input 
program  with  diagnostic  comments  added  to  tell  the  user  which 
loops  are  not  vectorized  and  why.  It  also  creates  an  enhanced 
version  of  the  input  program  which  includes  ST  100  processes  in 
place  of  i  le  vectorized  loops.  Thus,  the  source  remains 
transportable  and  readable.  Conversion  and  debugging  efforts 
are  much  reduced  and  made  easier.  The  gap  between  the  optimal 
efficiency  of  hand-coded  operations,  and  total  transportability  and 
maintainability  of  standard  Fortran  on  the  host  system  is  filled. 

Star  Technology  has  made  large  OEM  sales  with  CDC  and 
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GE.  It  appears  to  be  a  stable  company. 

Basic  Conf iourations 

Base  Price  :  $250,000  ( 5 1 2 K  word  memory,  host 
interface,  development  software 
and  standard  library) 

Additional  Memory  :  $51,200/512K  word 

Additional  Host  Interface  :  $10,000 
Additional  Host  Software  :  $12,600 
Unix  interface  will  be  available  four  months 
after  receiving  any  order. 

Users'  comments: 

The  Fortran  modules  provided  are  easy  to  program. 
However,  setting  up  a  module  requires  several  hundred  cycles, 
therefore  it  is  only  worthwhile  to  call  these  modules  if  the  vector  is 
of  several  thousand  elements.  That  is,  to  fully  utilize  the  machine, 
the  problem  to  be  solved  must  vectorize  very  well.  As  an 
alternative,  one  can  use  macro  programming. 

It  is  more  tedious  to  program  ST- 100  as  compare  to  other 
array  processors.  One  has  to  keep  track  of  how  the  data  &r  ? 
arranged  in  the  cache  memory  and  there  are  two  separate 
processors  to  be  considered,  the  Arithmetic  Control  Processor 
and  the  Storage  Move  Processor.  Typical  speed  is  25  Mflops  if  the 
problem  being  solved  has  only  one  DO  loop.  The  bottle  neck 
which  one  user  experienced  is  at  the  interface  between  the  cache 
memory  and  the  arithmetic  section.  Since  the  cache  memory  only 
allow  two  reads  and  one  write  per  cycle,  in  some  applications,  only 
one  out  of  four  of  the  arithmetic  units  can  be  running  at  each 
cycle.  Also,  if  the  host  is  a  VAX,  one  may  anticipate  another  bottle 
neck  at  the  1M  byte  interface  between  the  host  and  the  ST- 100. 

The  manuals  are  good  and  extensive.  They  cover  fine 
details  and  aim  at  audience  with  no  previous  knowledge  on  array 
processors.  Software  simulation  package  can  be  run  on  host 
machine  for  software  development.  One  test  program  on 
simulation  of  equations  of  motion  indicates  the  speed  ratio  among 
ST  100  :  MARS  432  :  FPS-120B  is  14  :  57  :  96  The  multi  users 
facilities  is  only  available  at  the  later  version  and  it  works  on  one 
job  at  a  time  instead  of  time  sharing  between  jobs.  The  debugger 
is  not  available  yet.  STAR  offers  good  services  and  responses 
quickly.  They  are  very  conscienctious  and  helpful  to  problems 
encounter  by  users.  The  users  are  in  general  very  pleased  with 
the  machine. 


The  FPS-5000  series  is  the  most  recent  series  of  array 
processors  introduced  by  the  Floating  Point  Systems.  It  consists 
of  the  product  groups  •  5100,  5200,  5300  and  5400  with  peak 
performance  that  ranges  from  8  Mflops  to  62  Mflops. 

The  family  utilizes  independent  floating  point  processing 
units,  called  Arithmetic  Coprocessors.  Data  flow  is  simultaneously 
managed  by  a  combination  of  independent  Input/Output 
Processors  and  a  central  Control  Processor.  Arithmetic 
Coprocessors  can  be  added  as  field- installable  upgrades. 

The  internal  structure  of  the  Arithmetic  Coprocessor  is 
optimized  for  execution  of  the  FFT,  by  virtue  of  the  double  memory 
access  cycles.  Special  butterfly  addressing  hardware  directs 
memory  access  during  an  FFT,  in  addition  to  simultaneous 
pipelined  operations  of  the  multiplier  and  two  adders.  The 
Arithmetic  Coprocessor  architecture  includes  internal  memory, 
ALU,  DMA  communication  and  diagnostic  elements.  Using  these 
parallel  control  and  data  path  structures,  18  Mflops  maximum 
performance  can  be  achieved  with  an  instruction  cycle  time  of  167 
ns.  A  maximum  of  three  Arithmetic  Coprocessors  can  be  added  to 
the  FPS-5430  model.  Adding  the  eight  Mflops  computation  power 
from  the  Control  Processor,  a  peak  performance  of  62  Mflops  can 
be  attained.  However,  data  transfer  between  coprocessors  must 
take  place  through  the  System  Common  Memory.  Different  data 
formats  are  used  for  the  Arithmetic  Coprocessor  and  the  Control 
Processor.  Thus  efficiency  and  accuracy  are  reduced  under 
some  applications. 

A  Multiple  Array  Execution  Language  (MAXL)  is  provided  for 
program  development.  It  is  a  multiple  processors  control 
language  that  is  a  subset  of  Fortran  77  and  it  generates  code  for 
both  Control  Processor  and  the  Arithmetic  Coprocessor.  AP- 
FORTRAN  is  another  option  that  provides  the  capability  of 
developing  array  processor  routines  using  standard  Fortran  IV 
slalements.  AP  FORTRAN  only  executes  on  the  Control 
Processor.  Also  available  for  program  development  are  the 
assembler,  simulator,  linker,  loader  and  debugger  packages. 

A  programmable  bit-slice  interface  processor  (GPIOP) 
provides  real  time  input/output  applications.  It  can  be 
programmed  to  interact  with  both  the  control  requirements  and 
the  data  transfer  protocols  of  most  peripheral  devices. 


FLOATING  POINT  SYSTEMS 


The  FPS-5000  series  maintains  software  compatibility  with 
previous  FPS  38-bit  processors,  and  is  supported  on  a  wide  range 
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of  host  computers  Thus,  the  software  support  developed  for 
FPS  100  and  FPS-120B  products  is  maintained  and  users  are  able 
to  move  existing  applications  on  FPS-120B  and  FPS- 100  onto  the 
FPS  5000.  The  Floating  Point  Systems  offers  by  far  the  most 
complete  software  libraries.  This  includes  Standard  and 
Advanced  Mathematics,  Signal,  Image  and  Geophysical 
Processing  and  Simulation  libraries. 

The  FPS  100  has  been  replaced  by  model  FPS-5105.  The 
FPS-120B  is  the  most  popular  array  processor  on  the  market.  It 
has  been  replaced  by  model  FPS- 5205. 

Basic  Cnnf  iqurat ions 

Base  Price  :  $99,000  (4K  word  program  memory, 

( F PS-5430 )  256K  System  Common  Memory,  host 

interface  and  development 
software) 

Additional  Program  Memory  :  S4.900/128K  byte 
Additional  Data  Memory  :  $7,800/256K  word 

Software  Library  :  -$1,000  each 

Unix  interface  is  available  from  outside  source. 

It  is  not  support  by  FPS. 

Input/oulput  is  fast  through  the  GPIOP  but  it  is  linearly 
addressing.  The  maximum  memory  is  1M  word  with  page  format 
which  cannot  be  crossed.  FPS  5205  is  more  cost  effective  than 
FPS-120B  since  it  provides  256K  memory  at  less  than  half  the 
price.  Cares  have  to  be  taken  for  the  format  conversion  between 
host,  control  processor  and  arithmetic  coprocessor  (32-bit  to  38- 
bit).  One  user  complains  that  FPS-120B  had  a  hardware  problem 
where  the  back  panels  do  not  align  properly  and  thus  contacts  are 
loose.  Another  user  comments  that  the  hardware  is  stable,  but 
large  amount  of  heat  is  produced  and  an  air-conditioning  room  is 
a  must.  A  user  of  the  FPS- 100  tried  to  split  old  program  to  run  half 
on  FPS-5205  control  processor  and  half  on  the  arithmetic 
processor,  and  he  obtained  twice  the  speed.  FPS  runs  Fortran  on 
the  host  and  calls  the  array  processor  by  subroutine  calls  and  this 
causes  a  lot  of  overhead. 

Extensive  manual  is  provided.  Software  subroutines  are 
useful  and  complete.  The  control  processor  runs  on  all 
subroutines  from  FPS-120B  or  FPS  100  model,  but  the  arithmetic 
coprocessor  only  runs  on  a  subset  of  the  subroutines.  One  user 
said  that  FPS's  debugger  is  useless  and  provide  no  information. 
The  service  from  FPS  is  good  and  they  response  quickly  to  users' 
problems.  In  general,  the  users  interviewed  are  happy  with  the 
machines. 


The  MARS-432  array  processor  is  the  newest  addition  to  the 
product  line  of  Numerix  Corporation.  The  key  elements  of  its 
architecture  are  the  Interlace  Processor(IP),  the  Data 
Processor(DP),  and  a  32-bit  wide  Data  Bus(DBUS). 

The  IP  controls  the  transfer  of  information  on  the  20M 
byte/sec  DBUS.  This  bus  is  used  to  interface  both  programs  and 
data  to  the  DP.  Data  memory(DM)  and  Program  Memory(PM)  are 
included  as  part  of  the  computational  processor,  the  DP.  This 
implementation  feature  allows  addilional  speed  to  be  obtained  in 
arithmetic  processing.  If  necessary,  a  data  path  between  the  two 
does  exist,  thereby  allowing  the  DM  to  be  used  as  a  bulk  storage 
area  for  large  program  memory. 

The  primary  arithmetic  elements  are  a  multiplier  and  two 
ALUs.  Again,  this  is  optimized  for  the  execution  of  the  FFT.  All 
units  may  execute  in  parallel  and  all  are  interconnected  with 
multiple  data  paths.  Each  of  these  elements  is  commanded 
independently  and  may  initiate  an  operation  every  100  ns.  This 
produces  an  arithmetic  execution  rate  of  30  Mflops.  Its 
throughput  rates  for  vector  add  and  vector  multiply  are  higher 
than  the  FPS  5430  because  of  this  shorter  cycle  time. 

The  Fortran  Development  System  provides  high  level 
language  access  to  MAR-432.  It  consists  of  a  Fortran  compiler, 
linker  and  trace/monitor.  The  off-line  development  package 
includes  the  macroassembler,  Loom  and  microcode  debugger. 
The  Loom  is  an  utility  that  provides  automatic  microcode 
optimization.  It  automates  the  writing  of  pipelined  code  at  the 
assembly  language  level  and  eliminates  the  need  to  hand  code 
pipelined  instructions.  User  interface  to  the  MARS  432  at  run  time 
is  taken  care  by  AREX.  AREX's  transactions  includes  array 
processor  initialization,  input/output  operation,  and  array  function 
execution.  MARS  432  offers  application  libraries  for  mathematic, 
signal  processing,  and  geophysical  processing. 

Basic  Conf igurations 

Base  Price  :  $120,000  ( 1M  word  memory,  host 
interface,  Fortran  compiler, 

Fortran  development  system  and 
standard  maths  library) 

Additional  Memory  :  $22 , 000/Mword 

Extended  Maths  library  :  $5,000  (includes  image 

processing  routines) 

Unix  interface  will  be  available  if  Numerix 
gets  an  order  for  it. 

Users'  comments: 
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The  Array  Processor  Run  time  Executivo(AREX)  starts  up 
slow,  a  long  data  stream  has  to  be  fed  in  to  yet  a  worthwhile  run 
The  speed  of  the  machine  is  ten  times  faster  than  EPS  1 20B  when 
the  whole  program  is  downloaded  into  the  machine.  The  machine 
has  a  large  memory  size  and  good  integration  of  A/D  and  D/A 
interfaces  but  the  input/output  is  slow.  Its  Fortran  77  compiler 
has  some  bugs  and  is  not  fully  developed  yet,  but  they  can  be  get 
around. 

MARS-432's  manuals  are  primitive  but  total,  new  manuals 
are  coming  out  every  couple  months.  The  software  library  is  not 
as  fully  developed  as  FPS  I20B  but  it  is  extremely  useful.  It  is  easy 
to  program  the  machine.  After  service  is  good,  Numerix  can  use 
modem  to  debug  the  machine  without  coming  to  the  site.  A  good 
3  weeks  training  course  is  offered  to  teach  the  users  how  to  figure 
out  the  configuration  according  to  the  users'  needs  One  user  got 
the  machine  for  6  months  and  the  down  time  is  more  than  the  up 
time  The  other  two  users  interviewed  are  in  general  happy  with 
the  machine 

CSPI 

The  MAP  400  is  the  newest  member  in  the  MAP's  family  of 
32  bit  floating  point  array  processors.  The  MAP's  family  uses  a 
multiprocessor,  three  buses  memory  architecture  to  achieve  high 
speed  and  high  throughput  Its  execution  is  divided  among 
several  specialized  processors  which  operate  in  parallel  Though 
pipelined  architecture  is  not  implemented,  its  performance  is 
comparable  to  machines  of  the  same  category  and  it  better  fits  a 
certain  kind  of  algorithm  which  is  inefficient  with  pipelined 
architectures. 

MAP  400  provides  a  choice  of  memory  speeds  (170  ns,  300 
ns.  500  ns).  Its  capability  can  be  added  to  an  existing  MAP  300 
system  The  hardware  system  contains  two  MAP  300  Arithmetic 
Processing  Units  (APU)  MAP's  multi  bus  memory  structure 
provides  parallel  data  transfer  for  the  APUs  during  calculation 
without  interference  The  data  work  boilers  for  the  separate  APUs 
may  be  configured  by  the  user  on  separate  memory  buses  thereby 
enabling  both  arithmetic  units  to  run  in  parallel  at  full  12  Mflops. 

The  MAP  400  software  system  is  an  extension  of  the 
MAP  200/300  system,  consisting  of  all  operational,  utility,  and 
diagnostic  software  The  Operating  System  software  consists  of  a 
set  of  programs  which  enables  the  user  to  call  the  supplied  library 
subroutines.  The  calling  programs,  executed  in  the  host,  are 


written  in  Fortran  In  addition  it  includes  several  commands  for 
executing  a  pair  ol  function  lists  in  parallel  The  function  lists  may 
contain  identical  functions  so  that  both  APUs  can  perform  the 
same  task  on  two  separate  blocks  ot  data  at  the  same  time. 
Because  tasks  are  carried  out  in  parallel,  it  is  necessary  to  assure 
synchronization  between  processors.  MAP  uses  a  data  driven 
hardware  structure  with  queues  internal  to  the  processors  to 
eliminate  the  need  for  concern  with  internal  synchronization. 

For  those  applications  requiring  direct  analog  input  or 
output,  MAP  provides  modules  which  handle  such  data  in  parallel 
with  concurrent  arithmetic.  It  also  provides  modules  which  allow 
direct  access  to  and  from  disk  and  bulk  memory  storage  device. 
Input  and  output  buffers  for  the  two  units  may  reside  on  the  same 
memory  bus;  however,  the  work  buffers  must  be  defined  on 
separate  buses  to  approach  a  doubling  in  processing  speed. 

Basic  Conf  i cm ra t i ons 

Base  Price  :  $54,850  (32K  program 

memory) 

Additional  Data  Memory  :  S19.500/256K  byte 

Additional  Program  Memory  ;  S4.000/4K  word 

Hardware  Interface  :  $3,500 

Software  System  ;  $3,500  (Executive  and 

Standard  Library) 

Mini  MAP  is  another  fully  programmable,  32-bit  floating 
point  array  processor  from  CSPI.  It  interfaces  with  DEC 
computers.  The  basic  system  consists  of  4  hex  boards  that  plug 
directly  into  the  Unibus. 

One  of  the  major  overheads  of  array  processing  is  passing 
data  and  commands  between  the  host  and  the  array  processor. 
Mmi-MAP's  memory  is  designed  to  share  memory  with  the  host. 
Thus,  the  host  CPU  can  process  and  move  data  in  the  Mini-MAP 
memory  in  the  same  way  it  uses  its  own  Unibus  memory.  In  effect, 
shared  memory  makes  Mini- MAP  a  coprocessor. 

The  cycle  of  the  machine  is  125  ns.  Each  addition  requires 
two  cycles  and  each  multiplication  requires  three  cycles.  It  has  a 
peak  performance  of  seven  Mflops  and  a  benchmark  testing  of  7.8 
ms  for  1024  points  complex  FFT.  It  has  no  local  program  memory. 

The  Host  Support  Library  is  an  extension  to  user's  Fortran 
runtime  library  that  is  used  for  controlling  Mini- MAP  programs 
from  a  host  Fortran  program.  Application  program  can  be 
developed  by  MAP  Control  Language  which  is  a  Fortran  subset 
language  A  set  of  utility  programs  consisting  of  an  assembler, 
compiler,  linker,  debugger  and  board  level  diagnostics  is  also 
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provided. 

Basic  Conf  iijurat.  ion 

Base  Price  :  $26,000  (>'4K  byte  memory) 

Additional  Memory  :  $6,000/lM  byte 

ANALOGIC  CORPORATION 

Analogic's  latest  array  processor  is  the  AP-500.  Its 
architecture  combines  a  Motorola  MC68000  based  control 
processor  with  an  internal  40  bit  full  floating  point  pipelined 
arithmetic  logical  unit  This  provides  better  precision  than  the 
other  32  bit  or  30  bit  processors  Externally,  it  conforms  with  DEC 
32  bit  full  floating  point  format. 

The  pipelined  ALU  consists  of  an  Arithmetic  Pipeline,  an 
Address  Generator  and  a  Pipeline  Sequencer.  These  three 
elements  operate  in  parallel  with  the  Control  Processor  and 
Input/Output,  and  fully  utilize  the  AP  Data  memory  bandwidth. 
Upon  entry  to  the  Arithmetic  Pipeline,  data  are  stored  in  the 
Pipeline  Register  Tiles  (PRFs)  Depending  upon  the  algorithm, 
data  can  bo  moved  between  PRFs  along  the  Bypass  and 
Feedback  Paths  of  the  multiplier  and  adder.  Thus,  data  may 
continually  circulate  within  the  Pipeime  until  final  results  are 
computed.  This  eliminates  the  need  for  time  consuming 
"Pipeline/Data  Memory"  data  exchanges. 

AP  500  offers  modular  subroutines  instead  of  the  more 
common  Fortran  compiler  for  the  users.  These  subroutines  are 
optimized  to  take  full  advantage  of  the  AP  500's  architecture. 
Applications  can  be  programmed  in  Host  or  AP  500  resident  High- 
Level  or  Assembly  Languages  The  AP  Executive  is  available  in 
single  or  multi  tasking  versions.  It  manages  internal  AP  500 
activities  and  resources  to  unburden  and  offload  the  host. 

Bas  i c  Conf  inuration 

Base  Price  :  $35,000  ( 0 . 5M  word  memory,  host 
interface  development  software 
and  standard  1  ibrary) 

:  $47,000  ( 1 . OM  word  memory,  host 
interface  development  software 
and  standard  library) 

Assembler  software  :  $1,500 
No  Unix  interface  is  available. 

Users'  comments; 

It  has  a  large  memory  of  0  5M  word,  twice  that  of  standard 
FPS  5205  Data  in  memory  can  be  accessed  readily  through  the 
buffer  and  no  paging  is  required  The  user  can  create  his  own 


buffer  size.  Input/Output  speed  is  slow  and  is  only  one  third  that 
of  FPS  5205.  It  take  up  little  physical  space  and  consumes  much 
less  power  than  FPS  5205. 

The  software  is  immature,  but  it  is  being  frequently  updated. 
All  the  routines  have  to  be  preloaded  before  they  are  called.  It  is 
easy  to  program  despite  the  fact  that  the  high  level  language  is  not 
in  Fortran  and  there  is  no  Fortran  Complier.  Service  contract  is 
recommended  in  the  west  coast  since  Analogic  does  not  have 
much  service  personnel  out  there. 

MERCURY  COMPUTERSYSTEM 

Mercury  Computer  Systems  offers  two  array  processors,  the 
ZIP-3216  which  performs  16  bit  and  32  bit  fixed  point  or  block 
floating  point  operations,  and  the  ZIP  3232  which  will  do  7  Mflops 
full  floating  point  when  available  in  first  quarter  1985.  The 
ZIP  3216  is  the  only  machine  in  this  survey  which  uses  block 
floating  point  format  It  is  an  interesting  machine  because  of  its 
cost  and  much  of  the  vision  works  can  use  16-bit  computations. 

The  dual  processors  operate  concurrently  and  are 
automatically  synchronized  in  hardware.  The  AMD  29116  based 
Control  Processor  controls  data  flows.  It  delivers  and  extracts  data 
of  virtually  any  format  to  and  from  the  Arithmetic  Pipeline. 

Within  the  ZIP  architecture,  the  Arithmetic  Pipeline,  which  is 
a  single  board,  is  independent  of  all  other  system  components.  It 
consists  of  a  multiplier  and  a  lull  ALU  which  provides  a  processing 
rate  of  20  million  computations  per  second  in  the  16  bit  mode  and 
5  million  computation  per  second  in  the  32  bit  mode. 

ZIP  3216  utilizes  C  like  instruction  language.  ZIP/C,  which 
is  augmented  with  special  functions  to  control  functional 
hardware.  One  of  the  characterist.es  of  the  language  is  the  writing 
of  arithmetic  instructions.  Output  results  can  be  directed  to  the 
FIFO  and/or  any  of  the  accumulators.  The  compiler  supports 
several  product  terms  in  the  algebraic  equation  and  the 
assignment  of  symbolic  names  to  any  variable.  The  ZIP/C 
compiler  is  able  to  distinguish  between  Control  Processor  and 
Arithmetic  Pipeline  codes  contextually  The  outputs  are  linked 
appropriately  to  create  an  application  task  file 

The  ZIP  Simulalor/Debuyger  provides  timing  analysis  of 
individual  algorithm  independently  of  the  ZIP  hardware.  Its 
debugging  facilities  include  a  multiple  viewing  window  into  user- 


defined  elements  of  the  ZIP  The  user  ran  display  any  component 
of  ZIP  Multiple  components  can  be  simultaneously  displayed  and 
user  created  display  formats  can  be  saved  and  recalled.  The 
Simulator/Debugger  allows  ZIP  programs  to  be  single  stepped, 
run  until  a  user  specified  condition  is  true,  or  run  to  completion. 
In  addition  trace  files  can  be  generated.  To  determine  algorithm 
execution  speed,  the  Simulation/Debugger  generates  timing 
statistics  as  well  as  utilization  efficiencies  for  different 
components.  The  users  can  also  simulate  clock  interrupts  and 
Dual  Direct  Access  Channel  input/output  data  flow. 

The  algorithm  library  includes  input/output  and  arithmetic 
functions  for  signal,  image/graphic  and  scientific  processing. 
The  functions  can  be  called  within  the  user's  host  computer 
program  or  within  the  ZIP-3216  program. 

Basic  Conf iquration 


Base  Price  :  18.000  (128K  byte  program) 

Additional  Memory  :  $3,000/S12K  byte 
Software  :  $2,000  (Development  tools  and 

run  time  software) 

Software  Library  :  $2,000-5,000  each 
Unix  interface  is  available 


The  Multibus  version  was  first  delivered  in  July,  and  the 
users  we  talked  to  only  have  the  machine  for  a  few  weeks.  They 
do  not  have  any  hardware  problem  yet.  The  simulation  package 
for  program  development  is  nice  and  easy  to  work  with.  The 
second  and  third  drafts  of  the  manuals  are  fairly  complete.  The 
manuals  are  getting  to  the  point  where  they  are  usable.  Some 
cosmetic  and  prettying  up  of  the  manuals  are  desirable. 

MARINCO  COMPUTER  PRODUCT 

The  Marinco  APB  3000  is  a  single  board  array  processor. 
Its  architecture  allows  a  direct  plug  in  compatibility  to  the  IEEE 
Multibus  and  is  configurable  to  fit  a  64k  byte  segment  of  memory. 
The  APB  300  actually  looks  like  a  random  access  memory  (RAM) 
board  to  the  host  processor  and  a  high  speed  auxiliary  BUS  may 
be  utilized  for  data  transfer  It  uses  24  bit  full  floating  point  and 
16  bit  integer  arithmetic.  This  makes  it  less  attractive  than  other 
machines. 

The  APB  300  machine  uses  a  parallel  16  bit  AMD29516 
multiplier  and  a  16  bit  microprocessor  based  AMD291 16  ALU.  It 
executes  instructions  in  125  ns  and  is  capable  of  up  to  8  million 
integer  or  1  million  floating  point  operations  per  second.  Its 


separated  Program  and  Data  Memories  allows  for  parallel  fetch 
and  execution  cycles.  Utilizing  two  buses  allows  parallel 
operations  such  as  inputing  data  in  the  ALU  and  loading  the 
multipier  from  the  Data  Memory.  Its  single  board  design  actually 
cuts  processing  time  by  minimizing  physical  connection  distance. 

Data  Memory  is  organized  as  24-bit  words  internal  to  the 
machine.  When  32  bit  floating  point  data  are  moved  from  host 
memory,  the  lower  8  bits  are  dropped.  In  the  reverse  direction,  8 
bits  of  zero  are  appended  to  the  24- bit  memory  word. 

The  APB-3000  is  fully  programmable  and  is  available  with  an 
optional  set  of  microcode  development  software.  Marinco's 
hierarchical  assembler(HIASM)  enables  microprogramming  in  a 
high-level  language.  Marinco’s  microassembler  (MARASM) 
processes  output  from  HIASM  or  direct  user  code,  and  produces 
an  object  code.  Marinco's  APB  monitor/debugger  enables  host 
based  debugging  of  programs  developed  using  HIASM  or 
MARASM.  It  provides  direct  access  to  all  APB  3000’s  registers, 
Data  Memory,  and  Program  Memory. 

The  operating  microcode  is  available  on  flexible  diskettes 
for  use  when  the  board  is  supplied  with  RAM  for  program  memory. 
It  can  be  supplied  with  PROMs  containing  the  necessary 
microcode  to  perform  operations  such  as  digital  littering  and  FFT. 


Base  Price  :  $4,250 
Assembler  :  $2,500 


The  storage  and  retrieval  of  data  require  1ms.  This  result  is 
100%  overhead  when  doing  floating  point  operations.  The 
machine  only  supports  addition  and  multiplication.  The  floating 
point  format  used  is  not  of  IEEE  standard  and  its  24  bits 
representation  causes  reduction  in  precision.  It  has  a  small  size 
program  memory  of  4K  words.  One  of  the  user  has  troubles  in 
reading  the  software  tape  and  the  board  did  not  work  properly 
when  put  on  the  Multibus  on  the  SUN  work  station.  Another  user 
took  one  and  a  half  month  to  figure  out  some  hardware  problem 
during  first  installation. 

There  is  no  software  support,  and  standard  subroutine  like 
multiplication  between  matrix  and  vector  has  to  be  programmed 
by  the  user.  Documentation  is  definitely  not  enough  though 
readable.  One  user  commented  that  he  could  only  figure  his  way 
out  by  making  some  correct  guesses  from  the  manual  Marinco 
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has  promised  more  documentations.  Microcoding  the  machine  is 
easy.  Marinco  provides  some  good  service  and  deals  with 
problems  immediately. 

SKY  COMPUTER 

Sky  Micro  Number  Krunchers  (SKYMNK)  is  a  full  32-  bit 
floating  point  array  processor  designed  for  use  with  16-bit 
microcomputer  systems.  These  array  processors  are  easy  to  use 
as  plug  in  modules.  SKYMNK  is  a  pipelined,  parallel  coprocessor, 
which  operates  internally  at  speed  of  up  to  one  Mflops.  Its  quoted 
speed  is  hard  to  achieve  in  reality  since  it  has  no  local  memory.  A 
floating  point  operation  must  transfer  two  32  bit  operands,  or  four 
16  bit  words  at  about  4  microseconds  minimum.  Designed  as  a 
tightly-coupled  coprocessor,  the  SKYMNK  operates  directly  on 
data  residing  anywhere  in  host  memory  .  Thus,  separate  Fortran 
calls  which  transfer  data  to  and  from  the  array  processor  are  not 
necessary.  Also,  its  DMA  architecture  eliminates  the  need  for  a 
costly  additional  memory.  Overlap  of  DMA  input/output  with 
processing  is  automatic  and  user  transparent  while  maximizing 
output. 

Each  SKYMNK  IS  supplied  with  software  support  including 
inline  driver,  subroutine  library,  software  simulator,  and  test 
diagnostic.  The  subroutine  library  contains  routines  for  vector 
mathematic  and  signal  processing.  They  can  be  called  from  either 
Fortran  or  Macro  user  program.  The  software  simulator  allows 
users  to  develop  code  for  the  SKYMNK  without  hardware  actually 
installed. 

Has  i c  Conf iqurations 
Base  Price  :  $6,990 

Software  :  $1,200  (Simulator,  diagnostic  and 
documentat  ion) 

:  $4,000  (Assembly  language  source 
code) 

CONCLUSION 

ST  100  from  Star  Technologies  is  by  far  the  fastest  array 
processor  in  this  study.  The  FPS  5000  series  from  the  Floating 
Point  Systems  has  the  most  software  support  because  of  its 
compatibility  with  previous  models  The  highest 

price/performance  ratios  are  offered  by  ZIP  3216  from  Mercury 
and  FPS  5430  from  the  Floating  Point  Systems  It  is  likely  that  the 
introduction  of  ZIP  3232  will  lower  the  price/performance  ratio 
further 


Most  of  the  machines  use  32-bit  floating  point 
representation,  however,  few  of  them  conform  to  IEEE  floating 
point  standard.  Few  of  the  manufacturers  support  the  VAX-Unix 
system  at  present.  However  it  is  likely  that  most  of  them  will 
provide  full  Unix  support  in  the  near  future  as  Unix  operating 
system  is  becoming  more  popular. 

Though  the  maximum  speed  in  Mflop  is  a  good  indication  of 
the  performance  of  a  machine,  we  have  to  remember  that  for  most 
algorithms  this  maximum  speed  cannot  be  reached.  Some  array 
processors  try  to  boost  their  peak  performance  by  providing  more 
multiplier  and  adder  units.  This  increase  in  speed  is  partly  offset 
by  the  communication  time  required  between  computational  units. 
Finally,  the  ease  of  programming  is  another  important  factor  to  be 
considered.  This  usually  can  only  be  learned  through  actual 
experience  with  the  machines. 
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Table  of  Comparison 


STion 

FPS5430 

MARS432 

MAP400 

AP500 

ZIP3216 

APIS3000 

SKY 

IPS  100 

FPS120B 

Configurations 

Base  Price(l) 

250000 

99000 

120000 

55000 

35000 

8000 

4250 

5990 

45000 

55000 

(dollar) 

Max.  Speed 

100 

62 

30 

24 

9.4 

5 

1 

1 

8 

12 

(Mf 1  ops ) 

Price/Speed 

Ratio 

2500 

1600 

4000 

2300 

3700 

1600 

4250 

5990 

5600 

4600 

Float  Point 

Full 

Full 

Ful  1 

Ful  1 

Full 

Block 

Full 

Full 

Full 

Full 

Format 

Number  of 
bits 

32 

38(2) 

32 

32 

40 

32 

24 

32 

38(2) 

38(2) 

Integer 

T  ormat 

32 

28(2) 

32 

32 

32 

32 

16 

■ 

28(2) 

28(2) 

Max.  Program 

256k 

128k 

16k 

224k 

128k 

1 0k 

12k 

host 

32k 

64k 

Memory  (byte) 
Max.  Data 

Memory  (word) 

8M 

512k 

512k 

312k 

912k 

16M 

16k 

host 

64k 

512k 

Hardware 

Machine  cycle 
(ns) 

40 

167 

100 

200 

125 

200 

125 

143 

250 

167 

I/O  channel 
( Mby te/sec ) 

100 

12 

20 

36 

25 

20 

20 

4 

8 

12 

Arithmetic  Unit 

Adder/Subtract 
Un  i  t 

2 

7 

2 

4 

t 

1 

1 

1 

1 

1 

Mu  1 1  ipl  ier 

2 

4 

t 

4 

1 

1 

1 

1 

1 

1 

Un  i  t 

Adder  Cycle 

3 

5 

5 

1(3) 

1(4) 

1 

- 

- 

2 

2 

Mu  1 1 ip  1 ier 

Cyc  1  e 

3 

5 

5 

2(3) 

2(4) 

1 

- 

- 

3 

3 

Software 

High  level 

FOR  IRAN 

1  OR  I  RAN 

rORIRAN 

FORI  RAN 

Yes 

C 

TOR IRAN 

FOR  IRAN 

FORI  RAN 

FORI  RAN 

1 anguage 

Macro 

Yes 

Yes 

Yes 

Yes 

Yes 

_ 

Yes 

Yes 

Yes 

Yes 

Assembler 

S i mu  1  a  tor 

Yes 

Yes 

_ 

Yes 

. 

Yes 

_ 

Yes 

Yes 

Yes 

Debugger 

- 

- 

Yes 

Yes 

- 

- 

Yes 

Yes 

(II  njh  Leve  1  ) 
Debugger 

Yes 

Yes 

Yes 

_ 

_ 

_ 

Yes 

Yes 

( 1 ow  level) 
Linker 

Yes 

Yes 

Yes 

_ 

Yes 

Yes 

_ 

_ 

. 

L  ibrary 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

Math 

- 

Yes 

Yes 

- 

- 

Yes 

- 

- 

Yes 

Yes 

Signal  Proc . 

- 

Yes 

Yes 

- 

- 

Yes 

- 

- 

Yes 

Yes 

Image  Proc. 

- 

Yes 

Yes 

- 

- 

Yes 

- 

- 

Yes 

Yes 

Geophys i ca 1 

- 

Yes 

Yes 

- 

- 

- 

- 

- 

Yes 

Yes 

Simulation 

- 

Yes 

Yes 

- 

- 

- 

- 

- 

Yes 

Yes 
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Table  of  Comparison  (Cont.) 


Performance 

ST  100 

FPS5430 

MARS432 

MAP400 

AP500 

ZIP3216 

APB3000 

SKI 

F  PS  1 00 

FPS120B 

Vector  Add 
(us) 

0.04 

0.25 

0.2 

0.3 

0.49 

0.4 

- 

1.  1 

- 

- 

Vector  Multiply 
(us) 

0.04 

0.25 

0.2 

0.3 

0.49 

0.4 

— 

1.0 

- 

- 

Vector  Divide 
(us) 

0.56 

1.92 

0.5 

0.8 

1.12 

14.0 

- 

- 

Complex  1024 
pt.  FFT  (ms) 

0.864 

2.56 

1.7 

2.7 

4.7 

13.0 

4.5 

53.1 

- 

- 

Complex  2D  FFT 
512X512  (ms) 

400.0 

500.0 

4200 

“ 

- 

5100 

3400 

Matrix  Multiply 
100X100  (ms) 

44.4 

71.0 

365.0 

800.0 

“ 

- 

660 

439 

Convolut  ion 

128  by  8  (ms) 

0.044 

— 

0.168 

' 

— 

" 

- 

- 

- 

Convolution 

1024  by  32 ( ms ) 

I/O  Interface 

0.782 

5.803 

9.9 

VAX -UN  1 X ( 5 ) 

- 

Yes 

- 

- 

Yes 

- 

- 

Yes 

Yes 

VAX-VMS(DFC) 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

- 

- 

Yes 

Yes 

PDI'-ll-RSX-UM 

- 

Yes 

Yes 

Yes 

Yes 

Yes 

- 

Yes 

Yes 

Yes 

HP  1000 

- 

Yes 

- 

Yes 

Yes 

- 

- 

- 

Yes 

Yes 

Ec 1  ipse(OGC) 

- 

Yes 

- 

Yes 

Yes 

- 

- 

- 

Yes 

PE  3200 

Yes 

Yes 

- 

Yes 

- 

- 

- 

- 

Yes 

Yes 

SEL(GOULO) 

Yes 

Yes 

- 

- 

- 

- 

- 

Yes 

Yes 

Harr  i s 

- 

Yes 

- 

- 

- 

- 

- 

- 

Yes 

Yes 

Prime 

- 

Yes 

- 

- 

- 

- 

- 

- 

Yes 

Yes 

IHM4331 .3081 

Yes 

Yes 

- 

- 

- 

- 

- 

- 

Yes 

Mu  1 1  ibus 

- 

- 

- 

- 

Yes 

Yes 

Yes 

Yes 

- 

- 

Un  ibus 

Yes 

Yes 

Yes 

Yes 

Yes 

- 

- 

Yes 

Yes 

- 

0-bus 

- 

- 

- 

- 

Yes 

Yes 

- 

Yes 

- 

- 

Versabus 

- 

- 

- 

- 

- 

Yes 

- 

Yes 

- 

_ 

RS232 

- 

- 

- 

- 

Yes 

- 

- 

- 

- 

_ 

IBM-PC 

- 

- 

- 

- 

- 

Yes 

Yes 

- 

- 

- 

Notes 


(1)  detail  configurations  in  the  paper 

(2)  FPS5XXX  Arithmetic  Coprocessor  uses  32-bit  floating  point  and  24-bit 
integer 

(3)  one  cycle  is  240ns 

(4)  one  cycle  is  160ns 

(5)  full  descriptions  in  the  paper 


